11

glove-android: Using GloVe Word Embeddings for NLP In Android

 1 year ago
source link: https://proandroiddev.com/glove-android-using-glove-word-embeddings-for-nlp-in-android-b7e412cf5de6
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

glove-android: Using GloVe Word Embeddings for NLP In Android

Power of word embeddings, in Android!

1*OyRw2IBbB1N4krtV0lQR-A.png

A glimpse of the demo app using glove-android. The first and the third images (from L -> R) depict the ‘compare words’ feature which computes cosine similarity between two words. The second image shows embedding generation in action.

glove-android is an Android library that provides a clean interface to GloVe word embeddings, which have been quite popular in NLP applications. Word embeddings can be used to measure the semantic similarity between two words, as similar words would have embeddings (high-dimensional vectors) closer to each other.

Currently, the only supported embeddings are 50D GloVe vectors trained on the Wikipedia corpus. The story outlines how developers can add glove-android to their Android projects and also its internal working along with its limitations. Here’s the GitHub repo ->

Contents

What are GloVe word embeddings?

Word embeddings are high-dimensional vectors (lists) generated for each word present in a huge text corpus. These vectors are produced such that vectors of two words which have high semantic similarity, lie in the proximity of the each other in the embedding space.

To train the GloVe model, the co-occurrence matrix is used whose are ijthentry is 1, if the ith word and jth occur together in the sentence.

1*Q8B3amppcM1L9eVzXXkYpQ.png

An illustration of word embeddings in the embedding space. Words ‘king’ and ‘queen’ are related contextually and hence point (nearly) in the same direction, establishing high semantic similarity. ‘Ice’ is a different word and does lie in the proximity of the other two vectors.

The GloVe model is trained in such a way that similar words i.e. with high co-occurrence lie near other. We can calculate the cosine of the angle between the embeddings, and, if the value is closer to 1, it means the words are semantically related. A value of -1 depicts a high-level of disjointness.

Adding glove-android to an existing project

Developers can use the AAR of the library, found in the Releases section of the repository. Download the AAR from the latest release and place it in the app/libs folder of the app.

1*fMTm9XbBIYUux96TUF7BMw.png

glove-android.aar is placed in the app/libs, which houses the app’s private libraries.

Next, we need to inform Gradle about this AAR as it has to be included in the build. In the module-level build.gradle file, specifically, in the dependencies block, add,

dependencies {
...
implementation files('libs/glove-android.aar')
...
}

Sync the Gradle files and build the project. You should be ready to use glove-android in your project now. If you’re facing any issues with the installation, do open an issue on the repository.

Using glove-android with Kotlin

The word embeddings are loaded from a file present within the library’s package, hence there are no API calls to fetch them. The embeddings are loaded from a H5 file, which takes some time due to its large size ~40 MB. To load the embeddings in memory, we use GloVe.loadEmbeddings method which is a suspend function, and hence needs a CoroutineScope for execution.

The method needs a callback of type (GloveEmbeddings) -> Unit which returns an object of class GloveEmbeddings through which developers can access the word embeddings synchronously.

class MainActivity : ComponentActivity() {

private var gloveEmbeddings : GloVe.GloVeEmbeddings? = null

override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)

setContent {
// Activity UI here
}

// GloVe.loadEmbeddings is a suspendable function
// We need a coroutine scope to handle its execution
// off the main thread
CoroutineScope( Dispatchers.IO ).launch {
GloVe.loadEmbeddings { it ->
gloveEmbeddings = it
}
}

}

}

Next, we can use the gloveEmbeddings object to retrieve embeddings for any word,

val embedding1 = gloveEmbeddings!!.getEmbedding( "king" )
val embedding2 = gloveEmbeddings!!.getEmbedding( "queen" )
if( embedding1.isNotEmpty() && embedding2.isNotEmpty()) {
result = GloVe.compare( embedding1 , embedding2 ).toString()
}

If an embedding isn’t found, the getEmbedding method returns an empty float array, hence we check embedding1.isNotEmpty() .

GloVe.compare takes in two embeddings which are FloatArray and returns the cosine similarity, which is mathematically expressed as,

1*vEIxSv4NuMvWYMg51lglyw.png

Limitation — Increase in app’s package size

A limitation of glove-android is that it increases the host app’s package size considerably. This is because the 50D GloVe embeddings are packaged into the library and hence they’re a part of the app’s internal storage. glove-android also uses Chaquopy to read H5 files which is bundled as a dependency, leading to an increase in the app’s size.

How does glove-android work internally?

After having a glimpse on the official website of GloVe, where the embeddings are available for download as text files, we realize the huge sizes of those files. The embeddings used by glove-android , which are 50D vectors (with smallest dimension) trained on the Wikipedia 2014 dataset containing 6 billion tokens has a file size of 167 MB which will be added as-is in the app’s assets. Apart from file compression, constant-time retrieval is also needed, as searching through 6 billion tokens would take a lot of time. To solve these problems, glove-android has acquired the following the techniques,

  • Storing the embeddings in H5 format as multi-dimensional arrays
  • Reduction of floating point precision: from 32-bit precision to 16-bit precision
  • Storing the word-index mapping as a hash-table for near-constant time retrieval. Here ‘index’ refers to the position of the embedding in the multi-dimensional array.

The H5 format is an highly-efficient file format for storage of multi-dimensional arrays. Further, the precision of embeddings is reduced to float16 which results in a much smaller file size. This might affect performance slightly as the precision is reduced.

The word embeddings are stored in the H5 format, but how we do know that an embedding for a particular word lies at a specific index? We need to maintain a word-index mapping, which is stored as a dict in Python. Given a word, which is the ‘key’, we look for corresponding ‘value’ that represents the index of the embedding in the 2D array stored in the H5. This technique provides efficient storage and near-constant time retrieval.

import h5py
import numpy as np
import pickle

glove_file = open( "glove.6B\glove.6B\glove.6B.50d.txt" , "r" , encoding="utf-8" )
words = {}
embeddings = []
count = 0
for line in glove_file:
parts = line.strip().split()
word = parts[0]
embedding = [ float(parts[i]) for i in range( 1 , 51 ) ]
words[ word ] = count
embeddings.append( embedding )
count += 1
print( "Words processed" , count )

embeddings = np.array( embeddings )
hf = h5py.File( "glove_vectors_50d.h5" , "w" )
hf.create_dataset( "glove_vectors" , data=np.array( embeddings ).astype( 'float16') )
hf.close()

with open( "glove_words_50d.pkl" , "wb" ) as file:
pickle.dump( words , file )

There’s another Python script which reads the H5 file and the pickled dict and is executed in the Android app using Chaquopy.

Chaquopy is an Android library which is used to run Python scripts in Android apps. Here’s a blog, if you wish to learn more,

Hope you’ll try glove-android

glove-android is a tiny component which can add a great feature to Android apps. I hope you’ll try it in your projects and share the feedback on the Issues or Discussions page on GitHub. Thanks for reading, and have a nice day ahead!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK