2

Gender Prediction Using Mobile App Data

 3 years ago
source link: https://hackernoon.com/gender-prediction-using-mobile-app-data-06q35qa
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Gender Prediction Using Mobile App Data

@sagolTaras Baranyuk

17+ years of experience in creating software products in various positions.

I have developed a fascinating dataset — a list of users, installed applications, user gender, and statistics on the gender distribution for apps.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
DOI: 10.34740/KAGGLE/DSV/2309388

For a successful advertising campaign, working with a segment is vital, and the gender of the user simplifies the work of selecting segments at times.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

I will tell you how collecting statistics on applications allow ML to predict a user’s gender.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Two new files have been added to the dataset:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

users.csv — A list of users with their most likely gender and a list of several installed applications.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

bundles_gender.csv — Gender distribution of users in the application.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Pay attention to the

cnt
field — it shows the number of users who have this app installed, whose gender we know, and, accordingly, we can collect statistics regarding the app. This field can also be used as a measure of confidence in information about this application.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

First of all, it is interesting to look at how gender is distributed among devices.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

One might expect the devices to be roughly equally divided, but this has not happened. Therefore, I hypothesize that women are less likely to indicate their gender in the app’s settings.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Perhaps this is influenced by the fact that the number of applications that are made exclusively for the female audience is less than for the male audience. The following picture indirectly confirms this.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Let’s look at the histogram. The first one will be without additional filters.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Almost nothing is visible except for symmetrical and pronounced peaks. Let’s take a closer look at one of the peaks.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
genders_df[
    (genders_df['F']>=0.3325) & 
    (genders_df['F']<=0.3375)
].describe()

If you ignore the outlier, you can see that most of the applications from this subsample are extremely rare, which leads to a large number of the same values.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Let’s try to keep only those applications that are encountered more than 10 times.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Peaks are still visible but not so clear. Increasing the threshold to 50 almost eliminates the peaks.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The graph clearly shows that there are fewer applications with a female audience.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

New Features

Let’s create a few more additional features that will show additional information.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

It can be assumed that the number of installed applications can be helpful.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
users_df['apps_count'] = users_df['ids'].apply(len)
users_df.groupby('gend')['apps_count'].describe()

You can see that women, on average, install more apps on their devices.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

I have data about users, their gender and installed apps, and information about the distribution of gender for these applications. Is there a correlation between this data? It is logical to assume that there is, but how strong is this correlation?

0 reactions
heart.png
light.png
money.png
thumbs-down.png
g_dict = genders_df['F'].to_dict()
users_df['F_prob'] = users_df['ids'].apply(
    lambda x: np.mean(
        list(filter(None.__ne__, list(map(g_dict.get, x))))
    )
)

Instead of the average, you can use more complex methods, but for the initial analysis, this is quite enough.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
np.corrcoef(
    users_df['F_prob'],
    users_df['gend'].astype('category').cat.codes
)[0,1]

The correlation turned out to be very significant.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
-0.46602945129982887

The histogram shows that users are well divided into two groups.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Baseline

For conclusions and assessments, I need a baseline model. Therefore, I choose the simplest approach.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
print(f"Accuracy: \
    {accuracy_score(users_df['gend'].astype('category').cat.codes, users_df['F_prob']<0.5)}")
print(f"AUC: \
    {1 - roc_auc_score(users_df['gend'].astype('category').cat.codes, users_df['F_prob'])}")

Even such a naive approach gives a good result, but let’s try to improve it further.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
Accuracy:     0.740925288445762
AUC     :     0.7793767183917958

Train and Test

Since the dataset with users is large, I can select a subset on which the models will be checked and compared.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
train, test = train_test_split(
    users_df, train_size=0.7,
    random_state=0, stratify=users_df['gend'])

Logistic Regression

First, I’ll try the simplest and most common method — logistic regression. But for this, we need numeric features instead of lists with id. Again, I can use the simplest method — binarization. 

0 reactions
heart.png
light.png
money.png
thumbs-down.png

But there is an obvious problem — the number of unique ids.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

But the fact that the resulting binarized data will be sparse allows the use of sparse matrices.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
mlb = MultiLabelBinarizer(sparse_output=True)
mlb.fit(users_df['ids'])
train_mlb = mlb.transform(train['ids'])
test_mlb = mlb.transform(test['ids'])

I use the OOF (Out-of-Fold) approach to obtain reliable results and reduce the influence of randomness when dividing into training and validation subsamples. I don’t use third-party libraries and wrote a simple function. Please note that splitting the dataset into folds must be stratified.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
def get_oof_lr(n_folds, x_train, y, x_test, seeds):
    
    ntrain = x_train.shape[0]
    ntest = x_test.shape[0]  
        
    oof_train = np.zeros((len(seeds), ntrain, 2))
    oof_test = np.zeros((ntest, 2))
    oof_test_skf = np.empty((len(seeds), n_folds, ntest, 2))
    models = {}
    for iseed, seed in enumerate(seeds):
        kf = StratifiedKFold(
            n_splits=n_folds,
            shuffle=True,
            random_state=seed)          
        for i, (tr_i, t_i) in enumerate(kf.split(x_train, y)):
            print(f'\nSeed {seed}, Fold {i}')
            x_tr = x_train[tr_i, :]
            y_tr = y[tr_i]
            x_te = x_train[t_i, :]
            y_te = y[t_i]
            model = LogisticRegression(
                random_state=seed,
                max_iter = 10000,
                verbose=1,
                n_jobs=20
            )
            model.fit(x_tr, y_tr)
            oof_train[iseed, t_i, :] = \
                model.predict_proba(x_te)
            print(f"AUC: {roc_auc_score(y_te, oof_train[iseed, t_i, :][:,1])}")
            oof_test_skf[iseed, i, :, :] = \
                model.predict_proba(x_test)
            models[(seed, i)] = model
    oof_test[:, :] = oof_test_skf.mean(axis=1).mean(axis=0)
    oof_train = oof_train.mean(axis=0)
    return oof_train, oof_test, models
0 reactions
heart.png
light.png
money.png
thumbs-down.png
Seed 0, Fold 0: 0.8752592302937795
Seed 0, Fold 1: 0.8741272807694727
Seed 0, Fold 2: 0.8754404425783484
Seed 0, Fold 3: 0.8750862228494931
Seed 0, Fold 4: 0.8767777821454008
Seed 42, Fold 0: 0.876839970445301
Seed 42, Fold 1: 0.8771914077769174
Seed 42, Fold 2: 0.8762049208242458
Seed 42, Fold 3: 0.8725705419477277
Seed 42, Fold 4: 0.8731672122759209
Seed 888, Fold 0: 0.8752996641300741
Seed 888, Fold 1: 0.8749304780764804
Seed 888, Fold 2: 0.8762614986655877
Seed 888, Fold 3: 0.8765240184267109
Seed 888, Fold 4: 0.8725618258459555

Let’s check the prediction on the test subsample.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
Accuracy:     0.8208932240918818
AUC     :     0.8798990678456793

I would say that the difference is big compared to the baseline. I will assume that the quality can be increase by tuning the hyperparameters, let it be the reader’s homework.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

CatBoost #1

When I look at the ids feature, I see a list of tokens. Why not try working with this data like plain text?

0 reactions
heart.png
light.png
money.png
thumbs-down.png

I chose CatBoost as the free library for the model. CatBoost is a high-performance, open-source library for gradient boosting on decision trees. From release 0.19.1, it supports text features for classification on GPU out-of-the-box. The main advantage is that CatBoost can include categorical functions and text functions in your data without additional preprocessing. You can find more detail about text features in the article Unconventional Sentiment Analysis: BERT vs. Catboost.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
!pip install catboost

Let’s write a function to initialize and train the model.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
def fit_model(train_pool, test_pool, **kwargs):
    model = CatBoostClassifier(
        task_type='GPU',
        iterations=10000,
        eval_metric='AUC',
        od_type='Iter',
        od_wait=1000,
        learning_rate=0.1,
        **kwargs
    )
return model.fit(
        train_pool,
        eval_set=test_pool,
        verbose=1000,
        plot=False,
        use_best_model=True
    )

Unfortunately, in the current version of CatBoost, it is impossible to use a list of already prepared tokens. Therefore, let’s do a little trick — turn the feature into text and use it to create a model.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
users_df['ids_txt'] = \
    users_df['ids'].apply(
        lambda x: " ".join([str(i) for i in x ]))

As with logistic regression, I make an OOF prediction.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
columns = ['ids_txt', 'apps_count']
oof_train_cb, oof_test_cb, models_cb = get_oof_cb(
    n_folds=5,
    x_train=train[columns],
    y=train['gend'].values,
    x_test=test[columns],
    text_features=['ids_txt'],
    seeds=[0, 42, 888]
)

Model quality metrics in the test subsample show better quality than when using logistic regression.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
Accuracy:     0.8218224490121011
AUC     :     0.8856101448105046

Interestingly, two completely different approaches give very similar results. In such a situation, it is logical to assume that combining methods will give a synergistic effect.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

CatBoost #2

As a new feature, I’ve added OOF predictions from a logistic regression model. In addition, do not forget about the F_prob feature, which worked well for the base model.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
columns = ['ids_txt', 'F_prob', 'lr', 'apps_count']
oof_train_cb_2, oof_test_cb_2, models_cb_2 = get_oof(
    n_folds=5,
    x_train=train_2[columns],
    y=train_2['gend'].values,
    x_test=test_2[columns],
    text_features=['ids_txt'],
    seeds=[0, 42, 888]
)

I can say that the model almost ideally predicts the gender of the user using only information about the installed applications on the device.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
Accuracy:     0.836950230713273
AUC     :     0.9010077023800467

Summary

In this story, I:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • Introduced a new free dataset;
  • Did exploratory data analysis;
  • Created several new features;
  • Created several models for predicting the gender of a user of a mobile device.

All this required the accumulation of certain statistical information about applications users use and information about the distribution of gender among users for the applications themselves.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The code from the article can be viewed here.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
16
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png
by Taras Baranyuk @sagol. 17+ years of experience in creating software products in various positions.Read my stories
Join Hacker Noon

Create your free account to unlock your custom reading experience.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK