3 basic approaches in Bag of Words which are better than Word Embeddings
source link: https://www.tuicool.com/articles/hit/bM7Nvir
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Nowadays, every one is talking about Word (or Character, Sentence, Document) Embeddings. Is Bag of Words still worth using? Should we apply embedding in any scenario?
After reading this article, you will know:
- Why people say that Word Embedding is the silver bullet?
- When does Bag of Words win over Word Embeddings?
- 3 basic approaches in Bag of Words
- How can we build Bag of Words in a few line?
Why somebody say that Word Embeddings are the silver bullet?
In the-state-of-art of the NLP field, Embedding is the success way to resolve text related problem and outperform Bag of Words (BoW). Indeed, BoW introduced limitations large feature dimension, sparse representation etc. For word embedding, you may check out my previouspost.
Should we still use BoW? We may better use BoW in some scenarios
When does Bag of Word win Word Embedding?
You may still consider to use BoW rather than Word Embedding in the following situations:
- Building an baseline model. By using scikit-learn, there is just a few lines of code to build model. Later on, can using Deep Learning to build a bit it.
- If your dataset is small and context is domain specific, BoW may work better than Word Embedding. Context is very domain specific which means that you cannot find corresponding Vector from pre-trained word embedding models (GloVe, fastText etc).
How can we build Bag of Words in a few line?
There is 3 simple ways to build BoW model by using traditional powerful ML libraries.
Count Occurrence
Counting word occurrence. The reason behind of using this approach is that keyword or important signal will occur again and again. So if the number of occurrence represent the importance of word. More frequency means more importance.
doc = "In the-state-of-art of the NLP field, Embedding is the \ success way to resolve text related problem and outperform \ Bag of Words ( BoW ). Indeed, BoW introduced limitations \ large feature dimension, sparse representation etc."
count_vec = CountVectorizer() count_occurs = count_vec.fit_transform([doc]) count_occur_df = pd.DataFrame( (count, word) for word, count in zip(count_occurs.toarray().tolist()[0], count_vec.get_feature_names())) count_occur_df.columns = ['Word', 'Count'] count_occur_df.sort_values('Count', ascending=False, inplace=True) count_occur_df.head()
Output
Word: "of", Occurrence: 3 Word: "bow", Occurrence: 2 Word: "way", Occurrence: 1
Normalized Count Occurrence
If you think that extremely high frequency may dominate the result and causing model bias. Normalization can be apply to pipeline easily.
doc = "In the-state-of-art of the NLP field, Embedding is the \ success way to resolve text related problem and outperform \ Bag of Words ( BoW ). Indeed, BoW introduced limitations \ large feature dimension, sparse representation etc."
tfidf_vec = TfidfVectorizer() tfidf_count_occurs = tfidf_vec.fit_transform([doc]) tfidf_count_occur_df = pd.DataFrame( (count, word) for word, count in zip( tfidf_count_occurs.toarray().tolist()[0], tfidf_vec.get_feature_names())) tfidf_count_occur_df.columns = ['Word', 'Count'] tfidf_count_occur_df.sort_values('Count', ascending=False, inplace=True) tfidf_count_occur_df.head()
Output
Word: "of", Occurrence: 0.4286 Word: "bow", Occurrence: 0.4286 Word: "way", Occurrence: 0.1429
Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF take another approach which is believe that high frequency may not able to provide much information gain. In another word, rare words contribute more weights to the model.
Word importance will be increased if the number of occurrence within same document (i.e. training record). On the other hand, it will be decreased if it occurs in corpus (i.e. other training records).
doc = "In the-state-of-art of the NLP field, Embedding is the \ success way to resolve text related problem and outperform \ Bag of Words ( BoW ). Indeed, BoW introduced limitations \ large feature dimension, sparse representation etc."
norm_count_vec = TfidfVectorizer(use_idf=False, norm='l2') norm_count_occurs = norm_count_vec.fit_transform([doc]) norm_count_occur_df = pd.DataFrame( (count, word) for word, count in zip( norm_count_occurs.toarray().tolist()[0], norm_count_vec.get_feature_names())) norm_count_occur_df.columns = ['Word', 'Count'] norm_count_occur_df.sort_values( 'Count', ascending=False, inplace=True) norm_count_occur_df.head()
Output ( The value is exact same as normalized count occurrence as the demo code only include one document)
Word: "of", Occurrence: 0.4286 Word: "bow", Occurrence: 0.4286 Word: "way", Occurrence: 0.1429
Code
This sample code will compare among Count Occurrence, Normalized Count Occurrence and TF-IDF.
Having a sample function to get model by using different vectorize methods
def build_model(mode): # Intent to use default paramaters for show case vect = None if mode == 'count': vect = CountVectorizer() elif mode == 'tf': vect = TfidfVectorizer(use_idf=False, norm='l2') elif mode == 'tfidf': vect = TfidfVectorizer() else: raise ValueError('Mode should be either count or tfidf') return Pipeline([ ('vect', vect), ('clf' , LogisticRegression(solver='newton-cg',n_jobs=-1)) ])
Having another sample function to build a end-2-end pipeline
def pipeline(df, mode): x = preprocess_x(df) y = preprocess_y(df) model_pipeline = build_model(mode) cv = KFold(n_splits=10, shuffle=True) scores = cross_val_score( model_pipeline, x, y, cv=cv, scoring='accuracy') print("Accuracy: %0.4f (+/- %0.4f)" % ( scores.mean(), scores.std() * 2)) return model_pipeline
Let check number of vocabulary we need to handle
x = preprocess_x(x_train) y = y_train model_pipeline = build_model(mode='count') model_pipeline.fit(x, y)
print('Number of Vocabulary: %d'% (len(model_pipeline.named_steps['vect'].get_feature_names())))
Output
Number of Vocabulary: 130107
Invoking the pipeline by passing “count” (Count Occurrence), “tf” (Normalized Count Occurrence) and “tfidf” (TF-IDF)
print('Using Count Vectorizer------') model_pipeline = pipeline(x_train, y_train, mode='count')
print('Using TF Vectorizer------') model_pipeline = pipeline(x_train, y_train, mode='tf')
print('Using TF-IDF Vectorizer------') model_pipeline = pipeline(x_train, y_train, mode='tfidf')
Result is
Using Count Vectorizer------ Accuracy: 0.8892 (+/- 0.0198) Using TF Vectorizer------ Accuracy: 0.8071 (+/- 0.0110) Using TF-IDF Vectorizer------ Accuracy: 0.8917 (+/- 0.0072)
Conclusion
You can found out all code from github .
From previous experience, I tried to tackle the problem of classifying product category by giving a short description. For example, given “Fresh Apple” and the expected category is “Fruit”. Already able to have 80+ accuracy by using count occurrence approach only.
In this case, since the number of word per training record is just a few words (from 2 words to 10 words). It may not be a good idea to use Word Embedding as there is no much neighbor (words) for training the vectors.
On the other hand, scikit-learn provides other parameter to further tune the model input. You may need to take a look on the following features
- ngram_range: Rather than using single word, ngram can be defined as well
- binary: Besides counting occurrence, binary representation can be chosen.
- max_features: Instead of using all words, max number of word can be chosen to reduce the model complexity and size.
Also, some preprocessing steps can be executed within above library rather than handle it by yourself. For example, stop word removal, lower case etc. To have a better flexibility, I will use my own code to finish the preprocessing steps.
About Me
I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related.
Visit my blog from ttp://medium.com/@makcedward/
Get connection from https://www.linkedin.com/in/edwardma1026
Explore my code from https://github.com/makcedward
Check my kernal from https://www.kaggle.com/makcedward
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK