23

Combining numerical and text features in deep neural networks

 4 years ago
source link: https://towardsdatascience.com/combining-numerical-and-text-features-in-deep-neural-networks-f1af2c1dd13b?gi=623524040d95
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

myu2MbJ.png!web

Photo by Marius Masalar on Unsplash

In applied machine learning, data often consists of multiple data types, e.g. text and numerical data. To build a model which combines features from both domains, it is necessary to stack these features together. This post shows different solutions to combine natural language processing and traditional features in one single model in Keras (end-to-end learning).

Real-world data is different

Scientific data sets are usually limited to one single kind of data e.g. text, images or numerical data. It makes a lot of sense, as the goal is to compare new with existing models and approaches. In real-world scenarios data is often more divers. To utilize end-to-end learning neural networks, instead of manually stacking models, we need to combine these different feature spaces inside the neural network.

Let´s assume we want to solve a text classification problem and we have additional meta data for each of the documents in our corpus. In simple approaches, where our document is represented by a Bag-of-Words vector, we could just add our metadata to the vector as additional words, and we are done. But when using a modern approach like word embeddings, it´s a bit more complicated.

Special Tokens

The simple solution is to add our meta data as additional special embeddings. Similar to special tokens in language models like BERT , these embeddings are tokens that can occur like words. They are binary, therefore we don’t have a continuous value space. We need to transform our data into categorical features by binning or one-hot encoding. After we determined how many additional features we need, we can expand the the vocabulary size by the number of additional features and treat them as additional words.

Example: Our dictionary has 100 words and we have 10 additional features.

ryEF7vb.png!web
This illustration shows how every sequence is now beginning with features encoded as special embeddings

The sequence of embeddings now always starts with the meta data features (Special Tokens), therefore we must increase our sequence length by 10. Each of these 10 special embeddings represents one of the added features.

There are several drawbacks with this solution. We only have categorical features, not continuous values and even more important our embedding space mixes up text and meta data.

Multiple input models

To build a model, which can handle continuous data and text data without such limiting factors, we take a look at the internal representation of the data inside the model. At some point every neural network has an internal representation of the data. Typically this representation is just before the last (fully-connected) layer of the network is involved. For recurrent networks in NLP (e.g. LSTMs) this representation is a document embedding. By expanding this representation with our additional features, we overcome the limitations.

What happens in such a model is that we basically stack two models on top of each other, but preserve the ability to be trained simultaneously by the same target label. Therefore it’s called an end-to-end model.

Example:

In Keras this is possible with multiple input models. Again we have 100 words and 10 additional features.

nlp_input = Input(shape=(seq_length,)) 
meta_input = Input(shape=(10,) 
emb = Embedding(output_dim=embedding_size, input_dim=100, input_length=seq_length)(nlp_input) 
nlp_out = Bidirectional(LSTM(128)(emb) 
concat = concatenate([nlp_out, meta_input]) 
classifier = Dense(32, activation='relu')(concat) 
output = Dense(1, activation='sigmoid')(classifier) 
model = Model(inputs=[nlp_input , meta_input], outputs=[output])

We use a bidirectional LSTM model and combine its output with the metadata. Therefore we define two input layers and treat them in separate “data paths”( nlp_input and meta_input ). Our NLP data goes through the embedding transformation and the LSTM layer. The meta data is just going through some normalization, so we can concatenate it with the LSTM output directly ( nlp_out ). This combined vector is now the full representation of our input and can be finally classified in a fully-connected layer.

AvYR3yu.png!web

The architecture of a simple multiple input model

This concept is usable for any other domain, where sequence data from RNNs is mixed up with non-sequence data. Even further, it is possible to mix up even images, text and sequences into one single model.

Originally published at http://digital-thinking.de


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK