Gender Prediction Using Mobile App Data

June 18th 2021

@sagolTaras Baranyuk

17+ years of experience in creating software products in various positions.

I have developed a fascinating dataset — a list of users, installed applications, user gender, and statistics on the gender distribution for apps.

0 reactions

DOI: 10.34740/KAGGLE/DSV/2309388

For a successful advertising campaign, working with a segment is vital, and the gender of the user simplifies the work of selecting segments at times.

0 reactions

I will tell you how collecting statistics on applications allow ML to predict a user’s gender.

0 reactions

Two new files have been added to the dataset:

0 reactions

users.csv — A list of users with their most likely gender and a list of several installed applications.

0 reactions

bundles_gender.csv — Gender distribution of users in the application.

0 reactions

Pay attention to the

cnt

field — it shows the number of users who have this app installed, whose gender we know, and, accordingly, we can collect statistics regarding the app. This field can also be used as a measure of confidence in information about this application.

0 reactions

First of all, it is interesting to look at how gender is distributed among devices.

0 reactions

One might expect the devices to be roughly equally divided, but this has not happened. Therefore, I hypothesize that women are less likely to indicate their gender in the app’s settings.

0 reactions

Perhaps this is influenced by the fact that the number of applications that are made exclusively for the female audience is less than for the male audience. The following picture indirectly confirms this.

0 reactions

Let’s look at the histogram. The first one will be without additional filters.

0 reactions

Almost nothing is visible except for symmetrical and pronounced peaks. Let’s take a closer look at one of the peaks.

0 reactions

genders_df[
    (genders_df['F']>=0.3325) & 
    (genders_df['F']<=0.3375)
].describe()

If you ignore the outlier, you can see that most of the applications from this subsample are extremely rare, which leads to a large number of the same values.

0 reactions

Let’s try to keep only those applications that are encountered more than 10 times.

0 reactions

Peaks are still visible but not so clear. Increasing the threshold to 50 almost eliminates the peaks.

0 reactions

The graph clearly shows that there are fewer applications with a female audience.

0 reactions

New Features

Let’s create a few more additional features that will show additional information.

0 reactions

It can be assumed that the number of installed applications can be helpful.

0 reactions

users_df['apps_count'] = users_df['ids'].apply(len)
users_df.groupby('gend')['apps_count'].describe()

You can see that women, on average, install more apps on their devices.

0 reactions

I have data about users, their gender and installed apps, and information about the distribution of gender for these applications. Is there a correlation between this data? It is logical to assume that there is, but how strong is this correlation?

0 reactions

g_dict = genders_df['F'].to_dict()
users_df['F_prob'] = users_df['ids'].apply(
    lambda x: np.mean(
        list(filter(None.__ne__, list(map(g_dict.get, x))))
    )
)

Instead of the average, you can use more complex methods, but for the initial analysis, this is quite enough.

0 reactions

np.corrcoef(
    users_df['F_prob'],
    users_df['gend'].astype('category').cat.codes
)[0,1]

The correlation turned out to be very significant.

0 reactions

-0.46602945129982887

The histogram shows that users are well divided into two groups.

0 reactions

Baseline

For conclusions and assessments, I need a baseline model. Therefore, I choose the simplest approach.

0 reactions

print(f"Accuracy: \
    {accuracy_score(users_df['gend'].astype('category').cat.codes, users_df['F_prob']<0.5)}")
print(f"AUC: \
    {1 - roc_auc_score(users_df['gend'].astype('category').cat.codes, users_df['F_prob'])}")

Even such a naive approach gives a good result, but let’s try to improve it further.

0 reactions

Accuracy:     0.740925288445762
AUC     :     0.7793767183917958

Train and Test

Since the dataset with users is large, I can select a subset on which the models will be checked and compared.

0 reactions

train, test = train_test_split(
    users_df, train_size=0.7,
    random_state=0, stratify=users_df['gend'])

Logistic Regression

First, I’ll try the simplest and most common method — logistic regression. But for this, we need numeric features instead of lists with id. Again, I can use the simplest method — binarization.

0 reactions

But there is an obvious problem — the number of unique ids.

0 reactions

But the fact that the resulting binarized data will be sparse allows the use of sparse matrices.

0 reactions

mlb = MultiLabelBinarizer(sparse_output=True)
mlb.fit(users_df['ids'])
train_mlb = mlb.transform(train['ids'])
test_mlb = mlb.transform(test['ids'])

I use the OOF (Out-of-Fold) approach to obtain reliable results and reduce the influence of randomness when dividing into training and validation subsamples. I don’t use third-party libraries and wrote a simple function. Please note that splitting the dataset into folds must be stratified.

0 reactions

def get_oof_lr(n_folds, x_train, y, x_test, seeds):
    
    ntrain = x_train.shape[0]
    ntest = x_test.shape[0]  
        
    oof_train = np.zeros((len(seeds), ntrain, 2))
    oof_test = np.zeros((ntest, 2))
    oof_test_skf = np.empty((len(seeds), n_folds, ntest, 2))
    models = {}
    for iseed, seed in enumerate(seeds):
        kf = StratifiedKFold(
            n_splits=n_folds,
            shuffle=True,
            random_state=seed)          
        for i, (tr_i, t_i) in enumerate(kf.split(x_train, y)):
            print(f'\nSeed {seed}, Fold {i}')
            x_tr = x_train[tr_i, :]
            y_tr = y[tr_i]
            x_te = x_train[t_i, :]
            y_te = y[t_i]
            model = LogisticRegression(
                random_state=seed,
                max_iter = 10000,
                verbose=1,
                n_jobs=20
            )
            model.fit(x_tr, y_tr)
            oof_train[iseed, t_i, :] = \
                model.predict_proba(x_te)
            print(f"AUC: {roc_auc_score(y_te, oof_train[iseed, t_i, :][:,1])}")
            oof_test_skf[iseed, i, :, :] = \
                model.predict_proba(x_test)
            models[(seed, i)] = model
    oof_test[:, :] = oof_test_skf.mean(axis=1).mean(axis=0)
    oof_train = oof_train.mean(axis=0)
    return oof_train, oof_test, models

0 reactions

Seed 0, Fold 0: 0.8752592302937795
Seed 0, Fold 1: 0.8741272807694727
Seed 0, Fold 2: 0.8754404425783484
Seed 0, Fold 3: 0.8750862228494931
Seed 0, Fold 4: 0.8767777821454008
Seed 42, Fold 0: 0.876839970445301
Seed 42, Fold 1: 0.8771914077769174
Seed 42, Fold 2: 0.8762049208242458
Seed 42, Fold 3: 0.8725705419477277
Seed 42, Fold 4: 0.8731672122759209
Seed 888, Fold 0: 0.8752996641300741
Seed 888, Fold 1: 0.8749304780764804
Seed 888, Fold 2: 0.8762614986655877
Seed 888, Fold 3: 0.8765240184267109
Seed 888, Fold 4: 0.8725618258459555

Let’s check the prediction on the test subsample.

0 reactions

Accuracy:     0.8208932240918818
AUC     :     0.8798990678456793

I would say that the difference is big compared to the baseline. I will assume that the quality can be increase by tuning the hyperparameters, let it be the reader’s homework.

0 reactions

CatBoost #1

When I look at the ids feature, I see a list of tokens. Why not try working with this data like plain text?

0 reactions

I chose CatBoost as the free library for the model. CatBoost is a high-performance, open-source library for gradient boosting on decision trees. From release 0.19.1, it supports text features for classification on GPU out-of-the-box. The main advantage is that CatBoost can include categorical functions and text functions in your data without additional preprocessing. You can find more detail about text features in the article Unconventional Sentiment Analysis: BERT vs. Catboost.

0 reactions

!pip install catboost

Let’s write a function to initialize and train the model.

0 reactions

def fit_model(train_pool, test_pool, **kwargs):
    model = CatBoostClassifier(
        task_type='GPU',
        iterations=10000,
        eval_metric='AUC',
        od_type='Iter',
        od_wait=1000,
        learning_rate=0.1,
        **kwargs
    )
return model.fit(
        train_pool,
        eval_set=test_pool,
        verbose=1000,
        plot=False,
        use_best_model=True
    )

Unfortunately, in the current version of CatBoost, it is impossible to use a list of already prepared tokens. Therefore, let’s do a little trick — turn the feature into text and use it to create a model.

0 reactions

users_df['ids_txt'] = \
    users_df['ids'].apply(
        lambda x: " ".join([str(i) for i in x ]))

As with logistic regression, I make an OOF prediction.

0 reactions

columns = ['ids_txt', 'apps_count']
oof_train_cb, oof_test_cb, models_cb = get_oof_cb(
    n_folds=5,
    x_train=train[columns],
    y=train['gend'].values,
    x_test=test[columns],
    text_features=['ids_txt'],
    seeds=[0, 42, 888]
)

Model quality metrics in the test subsample show better quality than when using logistic regression.

0 reactions

Accuracy:     0.8218224490121011
AUC     :     0.8856101448105046

Interestingly, two completely different approaches give very similar results. In such a situation, it is logical to assume that combining methods will give a synergistic effect.

0 reactions

CatBoost #2

As a new feature, I’ve added OOF predictions from a logistic regression model. In addition, do not forget about the F_prob feature, which worked well for the base model.

0 reactions

columns = ['ids_txt', 'F_prob', 'lr', 'apps_count']
oof_train_cb_2, oof_test_cb_2, models_cb_2 = get_oof(
    n_folds=5,
    x_train=train_2[columns],
    y=train_2['gend'].values,
    x_test=test_2[columns],
    text_features=['ids_txt'],
    seeds=[0, 42, 888]
)

I can say that the model almost ideally predicts the gender of the user using only information about the installed applications on the device.

0 reactions

Accuracy:     0.836950230713273
AUC     :     0.9010077023800467

Summary

In this story, I:

0 reactions

Introduced a new free dataset;
Did exploratory data analysis;
Created several new features;
Created several models for predicting the gender of a user of a mobile device.

All this required the accumulation of certain statistical information about applications users use and information about the distribution of gender among users for the applications themselves.

0 reactions

The code from the article can be viewed here.

0 reactions

by Taras Baranyuk @sagol. 17+ years of experience in creating software products in various positions.Read my stories

Partner with us for Digital Product Development

Join Hacker Noon

Create your free account to unlock your custom reading experience.

Gender Prediction Using Mobile App Data

Gender Prediction Using Mobile App Data

@sagolTaras Baranyuk

New Features

Baseline

Train and Test

Logistic Regression

CatBoost #1

CatBoost #2

Summary

Recommend

茂名放鸡岛游记

Salesforce Developer at Thirdware Solution Limited (2 - 5 years Exp) - The Crazy...

5 Apps for Job-Seekers to Organize, Track, and Get Reminders of Job Applications

Growing Up With Computers

All the Best Features in Messages for macOS

植发第一股来了！高毛利率背后藏隐忧

高品质低利润宠粉收获忠诚粉丝，80后独立原创服装设计师快手带货日销百万

AAU - Volleyball

Revolut revenue grew by 57% in 2020

Microsoft Teams Up With Portal's Creator to Make Native Cloud Gaming

About Joyk