Sentence embeddings and caching huggingface models in github actions

For updates on Crime De-Coder:

AltAc Newsletter posts up to 4 are posted. Get in touch if you are interested in recruiting (nearly 100 people signed up so far, most with Phds, interested in positions)
I have a post on what distinguishes my crime analysis training from others – tech expertise plus I make it customized.

This is actually more born out from demand interacting with crime analysis groups – it doesn’t really make sense for me to swoop in and say “do X/Y/Z, here is my report”. My experience doing that with PDs is not good – my recommendations often do not get implemented. So it is important to upskill your current analysts to be able to conduct more rigorous analysis over time.

Some things it is better to have an external person do the report (such as program evaluation, cost-benefit analysis, or racial bias metrics). But things like generating hotspots and chronic offender lists, those should be something your crime analysis unit learns how to do.

You can of course hire me to do those things, but if you don’t train your local analysts to take over the job, you have to perpetually hire me to do that. Which may make sense for small departments without an internal crime analysis unit. But for large agencies you should be using your crime analysts to do reports like that.

This post, I also wanted to share some work on NLP. I have not been immune at work given all the craze. I am more impressed with vector embeddings and semantic search though than I am with GenAI applications. GenAI I think will be very useful, like a more advanced auto-complete, but I don’t think it will put us all out of work anytime soon.

Vector embeddings, for those not familiar, are taking a text input and turning it into numbers. You can then do nearness calculations with the numbers. So say you had plain text crime narrative notes on modus operandi, and wanted to search “breaking car window and stealing purse” – you don’t want to do an exact search on that phrase, but a search for text that is similar. So the vector similarity search may return a narrative that is “break truck back-window with pipe, take backpack”. Not exactly the same, but semantically very similar.

Sentencetransformers makes this supersimple (no need to use external APIs for this).

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

# Base query
query = ['breaking car window and stealing purse']

# Example MO
mo = ['break truck back window with pipe, steal backpack',
      'pushing in a window AC unit, steal TV and sneakers',
      'steal car from cloned fab, joyride']

#Compute embedding for both lists
qemb = model.encode(query, convert_to_tensor=True)
moemb = model.encode(mo, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(moemb, qemb)
print(cosine_scores)

ADCreHdvxF7fxcCFgBHXttBB8JF9nMnhhi5fGwkY7pTIwtglpWDjxVxG8ea2T1hhvfU_QXXfa09xAQYwIcqnzoS_BMy6V5-H6i0wxE-I0i3Fo1U2Y5vIePCUb-Wxmk0_kb8O8OY6zsNgZNGdGGeq-J9R1lci=w854-h433-s-no-gm?authuser=0

So you have your query, and often people just return the top-k results, since to know “how close is reasonably close” is difficult in practice (google pretty much always returns some results, even if they are off the mark!).

I am quite impressed with the general models for this (even very idiosyncratic documents and jargon it tends to do quite well). But if you want to embed specifically trained models on different tasks, I often turn to simpletransformers. Here is a model based on clinical case notes.

from simpletransformers.language_representation import RepresentationModel

model = RepresentationModel("bert", "medicalai/ClinicalBERT")
sent1 = ["Procedure X-ray of wrist"] # needs to be a list
wv1 = model.encode_sentences(sent1, combine_strategy="mean")

For another example, I shared a github repo, hugging_cache, where I did some work on how to cache huggingface models in a github actions script. This is useful for unit tests, so you don’t need to redownload the model everytime.

Github cache size for free accounts is 10 gigs (and you need the environment itself, which is a few gigs). So huggingface models up to say around 4 (maybe 5) gigs will work out OK in this workflow. So not quite up to par with the big llama models, but plenty large enough for these smaller embedding models.

It is not very difficult to build a local, in-memory vector search database. Which will be sufficient for many individuals. So a crime analyst could build a local search engine for use – redo vector DB on a batch job, then just have a tool/server to do the local lookup. (Or just use a local sqlite database is probably sufficient for a half dozen analysts/detectives may use this tool every now and then.)

Sentence embeddings and caching huggingface models in github actions

Sentence embeddings and caching huggingface models in github actions

Recommend

How to Build Elite Technical Teams Internally with Cybersecurity Training

Closed-loop recycling of organic flexible electronic devices paves way for susta...

由北向南

Here's How Fast Every Major Electric Car Actually Charges According To Independe...

促行业高质量发展，大董、天财商龙等发布中餐数字化人才培养体系

LUG 小聚：Git 简介与使用心得分享

neovim入门指南(三)：LSP配置(上)

水底

Ruby 3.4 to get the 'it' factor

Spring Load-Time Weaving

About Joyk