Understanding text classification with machine learning

Text classification is a machine learning technique that assigns predefined categories to open text. Text classifiers can be used to organize, structure and categorize virtually any type of text: from documents, medical studies and archives to the entire web.

This post will explain this Text Mining technique, its stages and procedures.

Neural Designer contains this machine learning technique so that you can apply them in practice. You can download a free trial here.

Introduction

The categorization or classification of information is one of the most widely used branches of Text Mining. The premise of classification techniques is simple: starting from a set of data with an assigned category or label, the objective is to build a system that can identify the existing documents' patterns for determining their class. The aim during the construction of the artificial intelligence model will be to minimize the error between the categories predicted by the system and the real ones previously assigned.

Some examples of applications for text classification are spam detection, sentiment analysis, hate speech detection, and fake news detection.

Depending on the number of document classes, text classification problems can be:

Binary: here, the model must identify whether a document belongs to a class or not. An example of this problem is spam detection, where we have to identify if an email is a spam or not.
Multiple: in this problems, each document must be assigned a class from a list of different ones. An example is sentiment analysis: emotions (happiness, sadness, joy,…) are the classes for this problem.

It is essential to highlight the difficulty of the data labelling process, which is generally carried out manually by experts in the field of application of the model to be generated.

This image resumes the text classification training process.

We can divide the text classification process into the following steps:

Data processing and transformation
Model training
Testing analysis

1. Data processing and transformation

The transformation process in a classification problem comprises two stages: normalization and numerical representation.

Normalization

Sometimes in classification problems, the computational cost is very high, and reducing the number of input variables helps obtain better results faster. For this purpose, document normalization is generally applied. This process consists of applying some of the following techniques for reducing the number of input words:

Lowercase transformation: for example, “LoWerCaSE” is transformed into “lowercase”.
Punctuation signs and special characters removal: punctuation signs and special characters like “;”,”#”, or “=” are removed.
Stop words elimination: Stop words are a set of commonly used words in any language that don’t provide any information for our model. For example, some stop words in the English language are “myself”, “can”, and “under”.
Short and long words deletion: short words are eliminated because they do not provide much information, for example, the word "he". On the other hand, long words are eliminated because of their low frequency in the documents.
Stemming: Every word is composed of a root, lemma (or lexeme), the part of the word that does not vary and indicates its central meaning, and a morpheme, which are particles that are added to the root for the formation of new words. The stemming technique replaces each word with its lemma to obtain a smaller number of input words.

Once we have processed and normalized the documents, we need to transform them into a numerical format that neural networks can process.

Numerical representation

The intuition behind this idea is based on representing documents as vectors in an n-dimensional vector space. These vectors can be interpreted (and used) by the neural network to perform different tasks. One of the simplest traditional text representation techniques is Bag of Words.

Bag of Words

Bag-of-Words (BoW) consists of constructing a dictionary for the working dataset and representing each document as a count of the words in it. In this type of representation, the document is represented by a vector of length equal to the number of words in the dictionary. Each vector elements will be the number of times each token is used in the document.

water	steak	want	don’t	the	some	and	I
"I want some water."	1	0	1	0	0	1	0	1
“I want the steak.”	0	1	1	0	1	0	0	1
“I want steak, and I want water.”	1	1	1	0	0	0	1	1
“I don’t want water.”	1	0	1	1	0	0	0	1

This method is called bag of words since the order of the words is not represented. It is also important to note that if new documents were introduced whose words are not present in the vocabulary, they could be transformed by omitting the unknown words.

The BoW model has several drawbacks in its use. One of the most relevant is that when the corpus size is considerable, the vocabulary size is consequently increased. Therefore, very sparse vector sets are obtained, with many zeros and large size, which implies a higher memory consumption.

2. Model training

Once the document's numerical representation has been obtained, we can start model training using a classification neural network. A classification neural network usually requires a scaling layer, one or several perceptron layers, and a probabilistic layer.

3. Testing analysis

As with any classification problem, model evaluation is essential. However, in text classification problems, evaluation measures are not absolute, as they depend on the specific classification task: classifying medical texts is not the same as classifying whether a review is positive or negative. Therefore, the most usual thing to do is to look in the literature for baselines for similar tasks and compare with them to see if we are getting acceptable results.

As with a traditional classification task, the most used metrics are:

Confusion matrix: In the confusion matrix, the rows represent the target classes in the data set and the columns the predicted output classes from the neural network. The confusion matrix can be represented as follows:

Predit class 1	...	Predit class N
Red class 1	#	#	#
...	#	#	#
Red class N	#	#	#

Accuracy: The proportion of correctly classified documents from the total for which the model predicted the class c.
$$ precision = \frac{\#\ true\ positives}{\#\ true\ positives + \#\ false\ positives}$$
Recall: The proportion of correctly classified documents among all documents in training set with class c.
$$ recall = \frac{\#\ true\ positives}{\#\ true\ positives + \#\ false\ positives}$$
F1-Score: Generally, a good classifier should balance accuracy and recall. For this purpose, we use the F1-score metric, which considers both parameters. This score will penalize the total value if either is too low.
$$ F1 = \frac{2·precision·recall}{precision + recal}$$

In addition, we must move between the line of overfitting and underfitting to arrive at a quality classifier. An under-fitted model has low variance, which means that whenever the same data is introduced, the same prediction is obtained, but its prediction is too far from reality. This phenomenon occurs when there is insufficient training data for the model to find the existing patterns in the data. On the other hand, an The way to achieve the optimal working point is to evaluate the model with data that has not been trained and has never been seen. For this reason, it is advisable to subdivide the corpus into multiple subsets (training, testing, and selection).

Conclusions

Text classification is one of the most widely used Text Mining techniques today, with applications ranging from classifying reviews into positive and negative, to sorting support messages by urgency.

In this article, we have seen the different stages of text classification problems: data processing and transformation, model training and testing analysis. Following these steps, we are able to build the best text classification models.

Understanding text classification with machine learning

Introduction

1. Data processing and transformation

Normalization

Numerical representation

Bag of Words

2. Model training

3. Testing analysis

Conclusions

Recommend

Ranking the Ace Attorney Prosecutors From the Original Trilogy

罗技G将与腾讯游戏建立合作伙伴关系，共同开发一款云游戏掌机

Top 10 Software Development Companies Working with Automotive AI

8月3日电商报/快手成立独立to B业务部门

【面经】被虐了之后，我翻烂了equals源码，总结如下

Unit Testing Made Easier With Pure Functions

348: Breaking News

Google Pixel 7 and Pixel 7 Pro pre-order and release dates revealed

Is Remote Work Responsible for Growing Cybersecurity Threats?

MikeInnes (Mike J Innes) · GitHub

About Joyk