50

BERT Tokenizers NuGet Package for C#

 2 years ago
source link: https://rubikscode.net/2021/11/01/bert-tokenizers-for-ml-net/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

BERT Tokenizers NuGet Package for C#

Nov 1, 2021 | .NET, AI, C# | 0 comments

In the previous article, I wrote about how you can use Huggingface transformers with ML.NET. The process is fairly simple, you need to convert the Huggingface model to the ONNX file format, and load it with ML.NET.

However, while working with BERT Models from Huggingface in combination with ML.NET, I stumbled upon several challenges. The problem was not in loading the model and using it, but in preparing the data and building all the other helpers for that process.

The biggest challenge by far was that I needed to implement my own tokenizer and pair them with the correct vocabulary. So, I decided to extend it with other possible cases and publish my implementation as an open-source project. If you find this useful, feel free to contribute. In this article, I will focus on supported models and use cases. 

This bundle of e-books is specially crafted for beginners. Everything from Python basics to the deployment of Machine Learning algorithms to production in one place. Become a Machine Learning Superhero TODAY!

In this article we cover:

1. Motivation for this project

2. BERT Tokenizer NuGet Package

3. Use case Example

1. Motivation for this project

If we were using Huggingface model in Python we could load both tokenizer and model from Huggingface like this:

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True,return_tensors='pt')

# Use the model
with torch.no_grad():
    model_output = model(**encoded_input)
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModel.from_pretrained('bert-base-cased')

And then use it like this:

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True,return_tensors='pt')

# Calculate embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

As I mentioned, I wanted to use BERT models from Huggingface within ML.NET. However, in ML.NET we don’t have that nice options. Thanks to the available tools it was easy to export Huggingface models into ONNX files and from there import them into ML.NET. The real problems come from tokens since no Tokenizer is available in ML.NET.

BERT models require specifically structured data. The worst part is that different models use different vocabularies for tokenization. For example, BERT for the German language, will not understand the same tokens as BERT Multilingual Model. This means that one Tokenizer implementation is not good enough for another. 

ONNX Model

On top of that, some Huggingface BERT models use cased vocabularies, while other use uncased vocabularies. There is a lot of space for mistakes and too little flexibility for experiments. For example, let’s analyze BERT Base Model, from Huggingface.

Its “official” name is bert-base-cases. The name indicates that it uses cased vocabulary, ie. the model makes difference between lower and upper letters. Its outputs and outputs are:

Bert Base Huggingface Model

Let’s focus on the inputs. We need to provide tokens (input_ids), along with an attention mask (attention_mask) and segmentation indexes. 

In ML.NET, first, we need to build classes that handle input and output from the model and use the ApplyONNXModel function to load the model. You can find out more about this in the previous article:

public class ModelInput
    {
        [VectorType(1, 32)]
        [ColumnName("input_ids")]
        public long[] InputIds { get; set; }

        [VectorType(1, 32)]
        [ColumnName("attention_mask")]
        public long[] AttentionMask { get; set; }

				[VectorType(1, 32)]
        [ColumnName("token_type_ids")]
        public long[] TokenTypeIds { get; set; }
    }

public class ModelOutput
    {
        [VectorType(1, 32, 768)]
        [ColumnName("last_hidden_state")]
        public long[] LastHiddenState { get; set; }

        [VectorType(1, 768)]
        [ColumnName("poller_output")]
        public long[] PollerOutput { get; set; }
    }
var pipeline = _mlContext.Transforms
                            .ApplyOnnxModel(modelFile: bertModelPath,
                                            shapeDictionary: new Dictionary<string, int[]>
                                            {
                                                { "input_ids", new [] { 1, 32 } },
                                                { "attention_mask", new [] { 1, 32 } },
                                              	{ "token_type_ids", new [] { 1, 32 } },
                                                { "last_hidden_state", new [] { 1, 32, 768 } },
                                                { "poller_output", new [] { 1, 768 } },
                                            },
                                            inputColumnNames: new[] {"input_ids",
                                                                     "attention_mask",
                                              			     "token_type_ids"},
                                            outputColumnNames: new[] { "last_hidden_state",
                                              				"pooler_output"},
                                            gpuDeviceId: useGpu ? 0 : (int?)null,
                                            fallbackToCpu: true);

Still, we don’t have an easy way to create tokens and provide them as input to this model. It was a challenge to create a correct object of ModelInput. Also, it is a challenge to make sense out of the output of the model. Now we can use BERTTokenizer NuGet Package for this purpose.

2. BERT Tokenizers NuGet Package

This NuGet Package should make your life easier. The goal is to be closer to ease of use in Python as much as possible. The complete stack provided in the Python API of Huggingface is very user-friendly and it paved the way for many people using SOTA NLP models in a straightforward way. Hopefully, one day we will be able to do the same with C# as well.

BERTTokenizers NuGet Package

To install the BERTTokenizers NuGet package use this command:

dotnet add package BERTTokenizers

Or you can install it with Package Manager:

Install-Package BERTTokenizers

2.1 Supported Models and Vocabularies

At the moment BertTokenizers support the following vocabularies:

  • BERT Base Cased – class BertBaseTokenizer
  • BERT Large Cased – class BertLargeTokenizer
  • BERT German Cased – class BertGermanTokenizer
  • BERT Multilingual Cased – class BertMultilingualTokenizer
  • BERT Base Uncased – class BertBaseUncasedTokenizer
  • BERT Large Uncased – class BertLargeUncasedTokenizer

With this wide range of models is supported. Not just BERT models, but let’s say and DistilBERT models.

2.2 Available methods

Tokenizer UML

Every class provides the same three functions:

  • Encode – the input into this method are sequence length (this is mandatory because ML.NET doesn’t support input of variable length) and the sentence that needs to be encoded. As the output, this method provides a list of tuples with – Token ID, Token Type and Attention Mask, for each token in the encoded sentence.
  • Tokenize – In case you need more info on the tokens, or you want to perform padding on your own this method will do the trick. Input is a sentence that needs to be tokenized and the output is the list of tuples with – Token, Token ID and Token Type, for each token in the sentence.
  • Untokenize– This method is used to reverse the process and put the list of tokens into meaningful words. It is used on the output of the model.

The UML of the project looks something like this:

BERTTokenizers Project UML

3. Use Case Example

Let’s go back to BERT Base Model. For it, we built our input class like this:

public class ModelInput
  {
      [VectorType(1, 32)]
      [ColumnName("input_ids")]
      public long[] InputIds { get; set; }

      [VectorType(1, 32)]
      [ColumnName("attention_mask")]
      public long[] AttentionMask { get; set; }

      [VectorType(1, 32)]
      [ColumnName("token_type_ids")]
      public long[] TokenTypeIds { get; set; }
}
Sentiment Analysis Visual

Before we create an object that we will send into the pipeline, we need to create tokens and we can do it like this:

var tokenizer = new BertBaseTokenizer();

var encoded = tokenizer.Encode(32, sentence);

var bertInput = new ModelInput()
                {
                    InputIds = encoded.InputIds,
                    AttentionMask = encoded.AttentionMask,
                    TypeIds = encoded.TokenTypeIds,
                };

Important note:The first parameter in the Encode method is the same as the sequence size in the VectorType decorator in the ModelInput class.

Conclusion

In this article, we saw how we can use BERTTokenizer NuGet package, to easily build tokens for BERT input.

Thanks for reading!

This bundle of e-books is specially crafted for beginners. Everything from Python basics to the deployment of Machine Learning algorithms to production in one place. Become a Machine Learning Superhero TODAY!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK