52

3 basic approaches in Bag of Words which are better than Word Embeddings

 6 years ago
source link: https://www.tuicool.com/articles/hit/bM7Nvir
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
aA3euuR.jpg!web
Photo: https://pixabay.com/en/hong-kong-china-city-chinese-asian-383963/

Nowadays, every one is talking about Word (or Character, Sentence, Document) Embeddings. Is Bag of Words still worth using? Should we apply embedding in any scenario?

After reading this article, you will know:

  • Why people say that Word Embedding is the silver bullet?
  • When does Bag of Words win over Word Embeddings?
  • 3 basic approaches in Bag of Words
  • How can we build Bag of Words in a few line?

Why somebody say that Word Embeddings are the silver bullet?

vuiIBjz.jpg!web
Photo: https://pixabay.com/en/books-stack-book-store-1163695/

In the-state-of-art of the NLP field, Embedding is the success way to resolve text related problem and outperform Bag of Words (BoW). Indeed, BoW introduced limitations large feature dimension, sparse representation etc. For word embedding, you may check out my previouspost.

Should we still use BoW? We may better use BoW in some scenarios

When does Bag of Word win Word Embedding?

aeiYFbu.jpg!web
Photo: https://www.ebay.co.uk/itm/Custom-Tote-Bag-Friday-My-Second-Favorite-F-Word-Gift-For-Her-Gift-For-Him-/122974487851

You may still consider to use BoW rather than Word Embedding in the following situations:

  1. Building an baseline model. By using scikit-learn, there is just a few lines of code to build model. Later on, can using Deep Learning to build a bit it.
  2. If your dataset is small and context is domain specific, BoW may work better than Word Embedding. Context is very domain specific which means that you cannot find corresponding Vector from pre-trained word embedding models (GloVe, fastText etc).

How can we build Bag of Words in a few line?

There is 3 simple ways to build BoW model by using traditional powerful ML libraries.

Count Occurrence

R7N3YbU.jpg!web
Photo: https://pixabay.com/en/home-money-euro-calculator-finance-366927/

Counting word occurrence. The reason behind of using this approach is that keyword or important signal will occur again and again. So if the number of occurrence represent the importance of word. More frequency means more importance.

doc = "In the-state-of-art of the NLP field, Embedding is the \
success way to resolve text related problem and outperform \
Bag of Words ( BoW ). Indeed, BoW introduced limitations \
large feature dimension, sparse representation etc."
count_vec = CountVectorizer()
count_occurs = count_vec.fit_transform([doc])
count_occur_df = pd.DataFrame(
    (count, word) for word, count in
     zip(count_occurs.toarray().tolist()[0], 
    count_vec.get_feature_names()))
count_occur_df.columns = ['Word', 'Count']
count_occur_df.sort_values('Count', ascending=False, inplace=True)
count_occur_df.head()

Output

Word: "of", Occurrence: 3
Word: "bow", Occurrence: 2
Word: "way", Occurrence: 1

Normalized Count Occurrence

If you think that extremely high frequency may dominate the result and causing model bias. Normalization can be apply to pipeline easily.

doc = "In the-state-of-art of the NLP field, Embedding is the \
success way to resolve text related problem and outperform \
Bag of Words ( BoW ). Indeed, BoW introduced limitations \
large feature dimension, sparse representation etc."
tfidf_vec = TfidfVectorizer()
tfidf_count_occurs = tfidf_vec.fit_transform([doc])
tfidf_count_occur_df = pd.DataFrame(
    (count, word) for word, count in zip(
    tfidf_count_occurs.toarray().tolist()[0],   
    tfidf_vec.get_feature_names()))
tfidf_count_occur_df.columns = ['Word', 'Count']
tfidf_count_occur_df.sort_values('Count', ascending=False, inplace=True)
tfidf_count_occur_df.head()

Output

Word: "of", Occurrence: 0.4286
Word: "bow", Occurrence: 0.4286
Word: "way", Occurrence: 0.1429

Term Frequency-Inverse Document Frequency (TF-IDF)

2Uv6JrE.png!web
Photo: http://mropengate.blogspot.com/2016/04/tf-idf-in-r-language.html

TF-IDF take another approach which is believe that high frequency may not able to provide much information gain. In another word, rare words contribute more weights to the model.

Word importance will be increased if the number of occurrence within same document (i.e. training record). On the other hand, it will be decreased if it occurs in corpus (i.e. other training records).

doc = "In the-state-of-art of the NLP field, Embedding is the \
success way to resolve text related problem and outperform \
Bag of Words ( BoW ). Indeed, BoW introduced limitations \
large feature dimension, sparse representation etc."
norm_count_vec = TfidfVectorizer(use_idf=False, norm='l2')
norm_count_occurs = norm_count_vec.fit_transform([doc])
norm_count_occur_df = pd.DataFrame(
    (count, word) for word, count in zip(
    norm_count_occurs.toarray().tolist()[0], 
    norm_count_vec.get_feature_names()))
norm_count_occur_df.columns = ['Word', 'Count']
norm_count_occur_df.sort_values(
    'Count', ascending=False, inplace=True)
norm_count_occur_df.head()

Output ( The value is exact same as normalized count occurrence as the demo code only include one document)

Word: "of", Occurrence: 0.4286
Word: "bow", Occurrence: 0.4286
Word: "way", Occurrence: 0.1429

Code

This sample code will compare among Count Occurrence, Normalized Count Occurrence and TF-IDF.

Having a sample function to get model by using different vectorize methods

def build_model(mode):
    # Intent to use default paramaters for show case
    vect = None
    if mode == 'count':
        vect = CountVectorizer()
    elif mode == 'tf':
        vect = TfidfVectorizer(use_idf=False, norm='l2')
    elif mode == 'tfidf':
        vect = TfidfVectorizer()
    else:
        raise ValueError('Mode should be either count or tfidf')
    
    return Pipeline([
        ('vect', vect),
        ('clf' , LogisticRegression(solver='newton-cg',n_jobs=-1))
    ])

Having another sample function to build a end-2-end pipeline

def pipeline(df, mode):
    x = preprocess_x(df)
    y = preprocess_y(df)
    
    model_pipeline = build_model(mode)
    cv = KFold(n_splits=10, shuffle=True)
    
    scores = cross_val_score(
        model_pipeline, x, y, cv=cv, scoring='accuracy')
    print("Accuracy: %0.4f (+/- %0.4f)" % (
        scores.mean(), scores.std() * 2))
    
    return model_pipeline

Let check number of vocabulary we need to handle

x = preprocess_x(x_train)
y = y_train
    
model_pipeline = build_model(mode='count')
model_pipeline.fit(x, y)
print('Number of Vocabulary: %d'% (len(model_pipeline.named_steps['vect'].get_feature_names())))

Output

Number of Vocabulary: 130107

Invoking the pipeline by passing “count” (Count Occurrence), “tf” (Normalized Count Occurrence) and “tfidf” (TF-IDF)

print('Using Count Vectorizer------')
model_pipeline = pipeline(x_train, y_train, mode='count')
print('Using TF Vectorizer------')
model_pipeline = pipeline(x_train, y_train, mode='tf')
print('Using TF-IDF Vectorizer------')
model_pipeline = pipeline(x_train, y_train, mode='tfidf')

Result is

Using Count Vectorizer------
Accuracy: 0.8892 (+/- 0.0198)
Using TF Vectorizer------
Accuracy: 0.8071 (+/- 0.0110)
Using TF-IDF Vectorizer------
Accuracy: 0.8917 (+/- 0.0072)

Conclusion

You can found out all code from github .

From previous experience, I tried to tackle the problem of classifying product category by giving a short description. For example, given “Fresh Apple” and the expected category is “Fruit”. Already able to have 80+ accuracy by using count occurrence approach only.

In this case, since the number of word per training record is just a few words (from 2 words to 10 words). It may not be a good idea to use Word Embedding as there is no much neighbor (words) for training the vectors.

On the other hand, scikit-learn provides other parameter to further tune the model input. You may need to take a look on the following features

  • ngram_range: Rather than using single word, ngram can be defined as well
  • binary: Besides counting occurrence, binary representation can be chosen.
  • max_features: Instead of using all words, max number of word can be chosen to reduce the model complexity and size.

Also, some preprocessing steps can be executed within above library rather than handle it by yourself. For example, stop word removal, lower case etc. To have a better flexibility, I will use my own code to finish the preprocessing steps.

About Me

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related.

Visit my blog from ttp://medium.com/@makcedward/

Get connection from https://www.linkedin.com/in/edwardma1026

Explore my code from https://github.com/makcedward

Check my kernal from https://www.kaggle.com/makcedward


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK