Beyond the names: Leveraging AI for gender inference and name parsing in a multilingual customer base

In today’s digital era, customer data management is more than just storing information. It requires sophisticated tools and strategies to process data, improve their quality and derive new insights. One such crucial task is accurately inferring a customer’s gender and parsing their name, since it can greatly impact the customer experience, improve organizing and matching data from different sources, and ensure consistency in data analysis.

In a multicultural context like Treatwell, the leading online booking platform for hair and beauty, this challenge becomes even more intricate due to the diversity in naming conventions and frequent inconsistencies in the data.

This blog post explores how we address these issues by harnessing the power of Artificial Intelligence, specifically Deep Learning techniques, to predict customer’s gender and separate first and last names even in the messiest data scenarios. We’ll discuss the challenges that arise, including multilingual complexity, mixed-up names, and unisex names, and how AI helps solve them, enhancing data accuracy and consequently, customer experience.

Artificial Intelligence distinguishing between males and females (DALL-E)

🌈 Disclaimer: While this work has focused on predicting customers’ gender based on their name, it is important to highlight that this does not in any way minimize or dismiss the reality and importance of gender fluidity or non-binary identities. Our goal here is not to constrain or binarize gender, but rather to contribute to the complexity of a task that remains lightly explored in literature. Our models and predictions are a reflection of existing databases and societal conventions, and we acknowledge that there is room for considerable expansion and refinement to better capture the diversity and complexity of human identities.

1. Why do we need such a tool?

Customer data can often be a mess with inconsistencies and missing information, including swapped first/last names and missing gender labels. This is the case where a tool like this becomes crucial, as it serves as a powerful solution to address these issues and offers a multitude of benefits:

Data Normalization to handle duplicated records: Record linkage is a technique that helps match and merge records from disparate sources to create a comprehensive and unified customer database. In this context, the first name and last name are fundamental features used for linkage and duplicate matching. However, it is not uncommon to encounter rows in which a person’s first and last name are switched, leading to potential mismatches in the linkage process.
Enhancing customer experience: By accurately identifying gender and distinguishing between first and last names, Treatwell can offer tailor-made recommendations for treatments and services.
Consistency: It is important to ensure consistency in customer data by standardizing the format and structure of names and identifying gender. This can reduce errors caused by inconsistent data entry and improve the accuracy of data analysis.

2. Challenges out there

There are several challenges that this kind of task leverages due to the multicultural and multilingual context in which Treatwell operates, making the task more complex and nuanced.

Mixed up first name and last names

In the customer database is not unusual to find rows in which a person’s first and last name are switched, as happens in the third row of the following table:

This occurrence is a result of the data collection process, which involves human data entry and, inevitably, human errors.

In the customer database, which serves as the initial dataset for training the model, the presence of swapped names and surnames introduces intrinsic noise. This noise is subsequently propagated into the model we intend to develop for recognizing first names and last names in a given string. The initial dataset comprises 500 million records containing names, surnames, and gender information of customers for all the 13 countries in which Treatwell operates.

Solution: To ensure the accuracy and effectiveness of our model and to minimize false negatives during the record linkage process, we implement a comprehensive preprocessing approach tailored to each country’s dataset. Several steps are involved:

Normalization — Converting all names and surnames to lowercase to ensure consistency and avoid case-related discrepancies.
Outliers removal — Given that the records are entered manually, the dataset may contain various outliers, such as typos or entries that refer to entities other than physical persons, like firms, shops, or fictional characters (e.g., “Mickey Mouse” or “Pizzeria Bella Napoli”). To handle this, we perform outlier removal by excluding rows with first and last names that occur with low frequency, retaining only the 90% most frequently occurring names and surnames. This helps us filter out irrelevant or erroneous data that could negatively impact our model’s performance.
Cleaning — To identify and correct potential instances where first names and last names are inverted, we leverage the concept of conditional probabilities [1], which enables us to assess the likelihood of specific name-surname combinations given a particular gender. When we encounter a row in the dataset with a first name x and last name y where P(name=x, surname=y | gender) is lower than P(name=y, surname=x | gender), it is likely that the first and last names are mistakenly inverted. As a result, approximately 20M rows were identified as likely having the first names and last names swapped and we confidently inverted them.
Deduplication — To avoid overrepresentation of common names, we only keep one instance of duplicate records during preprocessing. This ensures that no bias is introduced towards the most frequently occurring names in the dataset.

Statistical normalization of persons’ first and last name

After the preprocessing stage, the final cleaned dataset, ready for the subsequent steps, consisted of approximately 200M records.

Multilingual Complexity

Treatwell serves customers from diverse linguistic backgrounds, and this diversity is reflected in the names present in the customer database. Different languages have distinct naming conventions and structures, making it difficult to apply a one-size-fits-all approach. Furthermore, we come across names that include the Greek alphabet (e.g., ζωή) or specific characters common in Portuguese (e.g., João) and German (e.g., Müller).

Solution: To address the inherent variability, we employ a token-free approach that doesn’t rely on a predefined vocabulary to map words or subword units to tokens. Instead, we utilize a character-based encoding method that operates directly on UTF-8 bytes. This allows us to effectively represent each input string, regardless of the language or unique characters it contains.

Moreover, a character-based approach serves an additional purpose: aiding in gender inference. Certain characters or combinations of them can effectively convey a person’s gender. Indeed it is well-established that character n-grams, especially the last two characters of names, exhibit correlations with gender. For instance, in English, first names ending with “a” (e.g., “Linda,” “Maria”) are more likely to be associated with female names. Conversely, the suffix “ua” (e.g., “Joshua”) is a typical pattern for male names.

Unisex Names

While gender is typically associated with the first name, there are instances where the same first name can be used for one gender in one culture but another gender in a different culture. For example, the name “Andrea” is commonly used as a male name in Italian (derived from the Greek “Andréas”), but it is considered a female name in languages such as English, German, Hungarian, Czech, and Spanish. This poses a complexity as assigning a specific gender to a unisex name may lead to inaccuracies and biased outcomes. It is crucial for our model to handle these names with sensitivity and avoid making assumptions based solely on the name’s popularity within a specific gender group.

Solution: In order to tackle this issue, we believe that taking the last name into account could be helpful in correctly identifying the gender of unisex names. The last name often reflects a person’s ethnicity or cultural background, and by combining it with the first name, we can potentially gain additional insights that contribute to a more accurate determination of gender.

3. The power of AI

Deep learning techniques play a pivotal role in addressing the challenges of gender inference and first/last name separation in the context of our project. By framing gender inference as a binary classification problem and first/last name separation as a word tagging problem, we harness the capabilities of character-based machine learning algorithms to tackle these tasks effectively.

Our pipeline is implemented in a cascade manner, where the first/last name tagging model takes charge of identifying and separating the first and last names in a given string. Once the separation is accomplished, we concatenate the names in the correct order, with the first name preceding the last name. This order is crucial since our gender classification model relies on the assumption that the first name is always positioned at the beginning of the string.

Abstract schema of the AI model for first/last name separation and gender inference

Input representation

Each input string, to be interpretable for AI models, needs to be converted into a numeric space. Following the aforementioned character-based encoding, we map each character in the string to its corresponding code point. Since UTF-8 encodes code points in one to four bytes, depending on their values, this approach accommodates names with diverse characters. In the following example the encoding of "ciào" is [99, 105, 195, 160, 111]:

To efficiently implement this encoding methodology, we leverage the ByT5 tokenizer, which streamlines the character-based representation [2].

Modelling

Following dataset cleaning and preprocessing, the data is partitioned into three separate sets stratified by country: 70% is allocated for training, 10% for validation, and the remaining 20% for testing purposes. The data is then preprocessed in the desired format, and two models are trained:

Name Tagging Model:

Representation Preprocessing: The name tagging model is treated as a Named Entity Recognition (NER) task at the character level. Input is provided by code points from the tokenizer which are subsequently padded to match the length of the longest sequence in the batch using dynamic padding. The output is a sequence of target classes (first name/last name) associated with each character in the input. Spaces separating first and last names are designated a unique label (e.g., ‘o’).
For instance: [p, a, o, l, o, , l, u, c, a, , b, i, a, n, c, h, i] → [f, f, f, f, f, f, f, f, f, f, o, l, l, l, l, l, l, l].
Model Architecture: Given the nature of the sequences involved, the model is LSTM-based — a common approach for sequence-related tasks. Characters are input into the network and processed through several stacked alternating forward and backward LSTM layers. A Conditional Random Field (CRF) applied to the final layer provides a distribution over the tag at each position. The most likely character tag sequence is subsequently produced using a Viterbi decoder to ensure word-level consistency. This network architecture was implemented based upon the work of Kuru et al. [3] represented in the figure below. Training parameters included 15 epochs, 5 bidirectional LSTM layers, a batch size of 2048, and a learning rate of 1e-4 with AdamW optimizer. The total trainable parameters amount to 890K.

NER char-based model: 5-layer Bidirectional LSTM Network with a Viterbi decoder [3].

2. Gender Classification Model:

Representation Preprocessing: The gender classification model is conceptualized as a binary classification task, designed to discern between male and female gender labels. To ensure unisex names are properly handled, the encoded first name is chained with the last name as an input, using a space as separator. Falling outputs are predominantly either “M” or “F”. These outputs are then mapped using the following dictionary:

gender_to_idx = {
                  'F': 0,
                  'M': 1                  
}

Model Architecture: The gender classification model employs a deep learning architecture suitable for binary classification tasks. It takes the concatenated input of the first and last names and processes it through several layers. Drawing inspiration from Hu et al.’s LSTM model [4], the architecture nuances an embedding layer with 512 dimensions to encode bytes in input, followed by two unidirectional LSTM layers, each with 192 hidden dimensions.
Subsequently, the ultimate output vector from the latter LSTM layer is relayed to a stack of two fully-connected layers of 128 and 64 neurons, respectively. Each layer is succeeded by a dropout rate of 0.2, fortifying the model against overfitting.
Finally, the output layer, equipped with just a single output neuron, employs the Sigmoid activation function. The total trainable parameters amount to 1.06M.

Character-based LSTM model for gender classification

Evaluation

All the experiments and training have been performed on the AWS Cloud, using a g5.12xlarge EC2 instance.
Here is a table displaying the evaluation metrics for the gender classification and name tagging models on the test set. These results demonstrate the robustness and precision of our trained models. They effectively fulfilled their designated tasks, providing high accuracy levels, and striking a robust balance between precision and recall.

Metrics computed on the test set

Conclusions

Harnessing the capabilities of machine learning models, we established a comprehensive pipeline that can accurately infer gender and separate first and last names, even in complex and multilingual data scenarios. The power of AI and Deep Learning enables us to achieve higher accuracy, consistency, and improved customer experiences, contributing to Treatwell’s mission as a leading platform in the beauty and personal care industry.

Our success story does not just end here. Based on its effectiveness, this system has been seamlessly integrated into a tool utilized daily by our Marketplace Content team. This deployment provides a solution to clean up customer databases in a GDPR-compliant manner and contributes to organizational efficiency and accuracy. The implementation of this AI system profoundly changed the game; it reduced operational time by more than half and significantly mitigated errors inherent in human-led operations over the past months.

In conclusion, such a tool is extremely valuable for businesses that rely on customer data for their operations.

References

[1] Samuele Mazzanti “How to teach your computer to recognize names and surnames” (2019). URL: https://towardsdatascience.com/solving-an-unsupervised-problem-with-a-supervised-algorithm-df1e36096aba

[2] Linting Xue et al. “ByT5: Towards a token-free future with pre-trained byte-to-byte models” (2022). URL: https://arxiv.org/abs/2105.13626

[3] Onur Kuru et al. “CharNER: Character-Level Named Entity Recognition” (2016). URL: https://aclanthology.org/C16-1087/

[4] Yifan Hu et al. “What’s in a Name? — Gender Classification of Names with Character Based Machine Learning Models” (2021). URL: https://arxiv.org/abs/2102.03692

Beyond the names: Leveraging AI for gender inference and name parsing in a custo...

Beyond the names: Leveraging AI for gender inference and name parsing in a multilingual customer base

1. Why do we need such a tool?

2. Challenges out there

Multilingual Complexity

Unisex Names

3. The power of AI

Input representation

Modelling

Evaluation

Conclusions

References

Recommend

滴滴和他的对手们

腾讯汤道生：垂直领域的行业大模型是大模型目前最有效的落地方式

L38_建立一个真实的小世界

北邮人论坛十大_2023_10_11

豪华纯电SUV“天花板”：奔驰迈巴赫EQS实车现身

开发出替代光刻机制造工具！俄罗斯发力自研芯：7nm可期

L37_重新塑造活方式和社会结构

8月小结｜AIGC 进行时

需求巨大的“网约护士”，被困在供给里

System.Text.Json JsonConverter Test Helpers

About Joyk