3 basic approaches in Bag of Words which are better than Word Embeddings

Photo: https://pixabay.com/en/hong-kong-china-city-chinese-asian-383963/

Nowadays, every one is talking about Word (or Character, Sentence, Document) Embeddings. Is Bag of Words still worth using? Should we apply embedding in any scenario?

After reading this article, you will know:

Why people say that Word Embedding is the silver bullet?
When does Bag of Words win over Word Embeddings?
3 basic approaches in Bag of Words
How can we build Bag of Words in a few line?

Why somebody say that Word Embeddings are the silver bullet?

Photo: https://pixabay.com/en/books-stack-book-store-1163695/

In the-state-of-art of the NLP field, Embedding is the success way to resolve text related problem and outperform Bag of Words (BoW). Indeed, BoW introduced limitations large feature dimension, sparse representation etc. For word embedding, you may check out my previouspost.

Should we still use BoW? We may better use BoW in some scenarios

When does Bag of Word win Word Embedding?

Photo: https://www.ebay.co.uk/itm/Custom-Tote-Bag-Friday-My-Second-Favorite-F-Word-Gift-For-Her-Gift-For-Him-/122974487851

You may still consider to use BoW rather than Word Embedding in the following situations:

Building an baseline model. By using scikit-learn, there is just a few lines of code to build model. Later on, can using Deep Learning to build a bit it.
If your dataset is small and context is domain specific, BoW may work better than Word Embedding. Context is very domain specific which means that you cannot find corresponding Vector from pre-trained word embedding models (GloVe, fastText etc).

How can we build Bag of Words in a few line?

There is 3 simple ways to build BoW model by using traditional powerful ML libraries.

Count Occurrence

Photo: https://pixabay.com/en/home-money-euro-calculator-finance-366927/

Counting word occurrence. The reason behind of using this approach is that keyword or important signal will occur again and again. So if the number of occurrence represent the importance of word. More frequency means more importance.

doc = "In the-state-of-art of the NLP field, Embedding is the \
success way to resolve text related problem and outperform \
Bag of Words ( BoW ). Indeed, BoW introduced limitations \
large feature dimension, sparse representation etc."

count_vec = CountVectorizer()
count_occurs = count_vec.fit_transform([doc])
count_occur_df = pd.DataFrame(
    (count, word) for word, count in
     zip(count_occurs.toarray().tolist()[0], 
    count_vec.get_feature_names()))
count_occur_df.columns = ['Word', 'Count']
count_occur_df.sort_values('Count', ascending=False, inplace=True)
count_occur_df.head()

Output

Word: "of", Occurrence: 3
Word: "bow", Occurrence: 2
Word: "way", Occurrence: 1

Normalized Count Occurrence

If you think that extremely high frequency may dominate the result and causing model bias. Normalization can be apply to pipeline easily.

doc = "In the-state-of-art of the NLP field, Embedding is the \
success way to resolve text related problem and outperform \
Bag of Words ( BoW ). Indeed, BoW introduced limitations \
large feature dimension, sparse representation etc."

tfidf_vec = TfidfVectorizer()
tfidf_count_occurs = tfidf_vec.fit_transform([doc])
tfidf_count_occur_df = pd.DataFrame(
    (count, word) for word, count in zip(
    tfidf_count_occurs.toarray().tolist()[0],   
    tfidf_vec.get_feature_names()))
tfidf_count_occur_df.columns = ['Word', 'Count']
tfidf_count_occur_df.sort_values('Count', ascending=False, inplace=True)
tfidf_count_occur_df.head()

Output

Word: "of", Occurrence: 0.4286
Word: "bow", Occurrence: 0.4286
Word: "way", Occurrence: 0.1429

Term Frequency-Inverse Document Frequency (TF-IDF)

Photo: http://mropengate.blogspot.com/2016/04/tf-idf-in-r-language.html

TF-IDF take another approach which is believe that high frequency may not able to provide much information gain. In another word, rare words contribute more weights to the model.

Word importance will be increased if the number of occurrence within same document (i.e. training record). On the other hand, it will be decreased if it occurs in corpus (i.e. other training records).

doc = "In the-state-of-art of the NLP field, Embedding is the \
success way to resolve text related problem and outperform \
Bag of Words ( BoW ). Indeed, BoW introduced limitations \
large feature dimension, sparse representation etc."

norm_count_vec = TfidfVectorizer(use_idf=False, norm='l2')
norm_count_occurs = norm_count_vec.fit_transform([doc])
norm_count_occur_df = pd.DataFrame(
    (count, word) for word, count in zip(
    norm_count_occurs.toarray().tolist()[0], 
    norm_count_vec.get_feature_names()))
norm_count_occur_df.columns = ['Word', 'Count']
norm_count_occur_df.sort_values(
    'Count', ascending=False, inplace=True)
norm_count_occur_df.head()

Output ( The value is exact same as normalized count occurrence as the demo code only include one document)

Word: "of", Occurrence: 0.4286
Word: "bow", Occurrence: 0.4286
Word: "way", Occurrence: 0.1429

Code

This sample code will compare among Count Occurrence, Normalized Count Occurrence and TF-IDF.

Having a sample function to get model by using different vectorize methods

def build_model(mode):
    # Intent to use default paramaters for show case
    vect = None
    if mode == 'count':
        vect = CountVectorizer()
    elif mode == 'tf':
        vect = TfidfVectorizer(use_idf=False, norm='l2')
    elif mode == 'tfidf':
        vect = TfidfVectorizer()
    else:
        raise ValueError('Mode should be either count or tfidf')
    
    return Pipeline([
        ('vect', vect),
        ('clf' , LogisticRegression(solver='newton-cg',n_jobs=-1))
    ])

Having another sample function to build a end-2-end pipeline

def pipeline(df, mode):
    x = preprocess_x(df)
    y = preprocess_y(df)
    
    model_pipeline = build_model(mode)
    cv = KFold(n_splits=10, shuffle=True)
    
    scores = cross_val_score(
        model_pipeline, x, y, cv=cv, scoring='accuracy')
    print("Accuracy: %0.4f (+/- %0.4f)" % (
        scores.mean(), scores.std() * 2))
    
    return model_pipeline

Let check number of vocabulary we need to handle

x = preprocess_x(x_train)
y = y_train
    
model_pipeline = build_model(mode='count')
model_pipeline.fit(x, y)

print('Number of Vocabulary: %d'% (len(model_pipeline.named_steps['vect'].get_feature_names())))

Output

Number of Vocabulary: 130107

Invoking the pipeline by passing “count” (Count Occurrence), “tf” (Normalized Count Occurrence) and “tfidf” (TF-IDF)

print('Using Count Vectorizer------')
model_pipeline = pipeline(x_train, y_train, mode='count')

print('Using TF Vectorizer------')
model_pipeline = pipeline(x_train, y_train, mode='tf')

print('Using TF-IDF Vectorizer------')
model_pipeline = pipeline(x_train, y_train, mode='tfidf')

Result is

Using Count Vectorizer------
Accuracy: 0.8892 (+/- 0.0198)
Using TF Vectorizer------
Accuracy: 0.8071 (+/- 0.0110)
Using TF-IDF Vectorizer------
Accuracy: 0.8917 (+/- 0.0072)

Conclusion

You can found out all code from github .

From previous experience, I tried to tackle the problem of classifying product category by giving a short description. For example, given “Fresh Apple” and the expected category is “Fruit”. Already able to have 80+ accuracy by using count occurrence approach only.

In this case, since the number of word per training record is just a few words (from 2 words to 10 words). It may not be a good idea to use Word Embedding as there is no much neighbor (words) for training the vectors.

On the other hand, scikit-learn provides other parameter to further tune the model input. You may need to take a look on the following features

ngram_range: Rather than using single word, ngram can be defined as well
binary: Besides counting occurrence, binary representation can be chosen.
max_features: Instead of using all words, max number of word can be chosen to reduce the model complexity and size.

Also, some preprocessing steps can be executed within above library rather than handle it by yourself. For example, stop word removal, lower case etc. To have a better flexibility, I will use my own code to finish the preprocessing steps.

About Me

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related.

Visit my blog from ttp://medium.com/@makcedward/

Get connection from https://www.linkedin.com/in/edwardma1026

Explore my code from https://github.com/makcedward

Check my kernal from https://www.kaggle.com/makcedward

Why somebody say that Word Embeddings are the silver bullet?

When does Bag of Word win Word Embedding?

How can we build Bag of Words in a few line?

Count Occurrence

Term Frequency-Inverse Document Frequency (TF-IDF)

Code

Conclusion

About Me

Recommend

Data Science with Python: Intro to Data Visualization and Matplotlib

canvas中颜色过渡动画效果的实现

使用 Go 在 WASM 中进行图像处理的实验

TypeScript基础入门 - 基础类型

logstash beats 系列 & fluentd

Tutorial: Getting Started with Quantum Computing (Python)

Hyperledger Fabric: A Distributed Operating System for Permissioned Blockchains

一窥监视之国

疫苗事件发酵刘强东评：至少该判无期不得假释

Java生产环境下问题排查 - 不要吵到这里的Bug

About Joyk