8

Genpact Machine Learning Hackathon - 5th Place solution

 3 years ago
source link: https://varunbpatil.github.io/2018/12/25/genpact-ml-hackathon.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Genpact Machine Learning Hackathon - 5th Place solution

Dec 25, 2018

Here is my 5th place solution to the Genpact Machine Learning Hackathon conducted by Analytics Vidhya in December 2018.

The full Python code is available on my github repository.

Problem Statement

The task in this ML hackathon was to predict the number of food orders for an online food delivery business at each of their branches on a particular week in the future.

Solving such a problem is useful for planning just-in-time procurement of ingredients so as to reduce wastage and costs.

A look at the data

Here’s the training data we were asked to work with.

Column Description id Unique transaction id week Week number; training data had weeks 1 through 145 center_id Unique identifier for the branch of the online food delivery business meal_id Unique identifier for the meal checkout_price Price of the meal after discounts, coupons, etc base_price Base price of the meal emailer_for_promotion Boolean indicating whether the meal was promoted via email homepage_featured Boolean indicating whether the meal was featured on the website’s homepage num_orders The target (or dependent) variable we were asked to predict

There was also the following information about the branch of the food delivery business.

Column Description center_id Unique identifier for the branch of the online food delivery business city_code Unique identifier for the city in which the branch operates region_code Unique identifier for the region in which the branch operates center_type Categorical variable for the branch type op_area Operating area of the branch

Then, there was some information about the meal’s themselves.

Column Description meal_id Unique identifier for the meal category The meal category cuisine The meal cuisine (categorical variable)

Machine Learning Model

I decided to use the LightGBM regressor for this challenge since from my experience in such competitions, gradient boosted trees are very powerful and popular.

Feature Engineering and Data Transformations

I decided to use most of the given features as it is apart from the following new features I designed.

Feature Description week_sin Sine transform of the ‘week’ to capture cyclic dependency week_cos Cosine transform of the ‘week’ to capture cyclic dependency price_diff_percent Percentage difference between checkout_price and base_price

Sine and cosine transform’s are very frequently used to represent cyclic features like the ‘week’ in our case. This is useful when you are trying to capture dependencies like increased demand during a particular month every year due to a festival, for example.

The formula for the sine and cosine transform for the ‘week’ variable is as below:

week_sin = np.sin(2 * np.pi * week / 52.143)
week_cos = np.cos(2 * np.pi * week / 52.143)

Ofcourse, I decided to keep the original ‘week’ feature as well to capture long-term dependency (for example, increase in demand over the years).

I used scikit-learn’s label encoder to encode categorical variables since that is how LightGBM prefers it.

Transforming the target variable

I used the log transform (np.log1p()) on the target variable - num_orders - so that it looked more like a gaussian distribution (bell-shaped curve). The original ‘num_orders’ had values ranging from a few hunders to several thousands with a majority of the values in the lower range.

Another reason for the log transformation of the target variable was that the metric for the competition was RMSLE (root mean squared log error) which means after the log transformation of the target variable, I could simply use the build-in “mse” or “rmse” metric of LightGBM.

Hyperparameter tuning

I used scikit-learn’s Parameter Grid to systematically search through hyperparameter values for the LightGBM model.

The hyperparameters I tuned with this method are:

  1. colsample_bytree - Also called feature fraction, it is the fraction of features to consider while building a single gradient boosted tree. Reducing its value reduces overfitting by considering fewer features while building each tree.
  2. min_child_samples - The number of samples in the leaf node of the tree.
  3. num_leaves - The number of leaf nodes. Higher the number, the more complex and deeper the tree is going to be making the model overfit.

Choosing the cross-validation set

Since we are trying to predict the number of orders on a future date, it makes sense to order the training data by the ‘week’ in ascending order and then pick samples at the end of the list as our cross-validation set. For example, since we are given training data for week’s 1 through 145, we can consider data for week’s 1 through 140 as our training data and week’s 141 through 145 as our cross-validation data.

For this, I used scikit-learn’s train test split to split the given training data into a train and cross-validation set. Note that I explicitly set shuffle=False since we want the data to be ordered by week and we want to take samples towards the end as our cross-validation set.

Solution

The full Python code is available on my github repository.

Read the training and test datasets.

df_train = pd.read_csv('train_GzS76OK/train.csv')
df_center_info = pd.read_csv('train_GzS76OK/fulfilment_center_info.csv')
df_meal_info = pd.read_csv('train_GzS76OK/meal_info.csv')
df_test = pd.read_csv('test_QoiMO9B.csv')

Merge with branch and meal information.

df_train = pd.merge(df_train, df_center_info,
                    how="left",
                    left_on='center_id',
                    right_on='center_id')

df_train = pd.merge(df_train, df_meal_info,
                    how='left',
                    left_on='meal_id',
                    right_on='meal_id')

df_test = pd.merge(df_test, df_center_info,
                   how="left",
                   left_on='center_id',
                   right_on='center_id')

df_test = pd.merge(df_test, df_meal_info,
                   how='left',
                   left_on='meal_id',
                   right_on='meal_id')

Feature engineering - Convert ‘city_code’ and ‘region_code’ into a single feature - ‘city_region’.

df_train['city_region'] = \
        df_train['city_code'].astype('str') + '_' + \
        df_train['region_code'].astype('str')

df_test['city_region'] = \
        df_test['city_code'].astype('str') + '_' + \
        df_test['region_code'].astype('str')

Label encode categorical features (label encoded features will have suffix encoded).

label_encode_columns = ['center_id', 
                        'meal_id', 
                        'city_code', 
                        'region_code',
                        'city_region',
                        'center_type', 
                        'category', 
                        'cuisine']

le = preprocessing.LabelEncoder()

for col in label_encode_columns:
    le.fit(df_train[col])
    df_train[col + '_encoded'] = le.transform(df_train[col])
    df_test[col + '_encoded'] = le.transform(df_test[col])

Feature engineering - Sine and Cosine transform for ‘week’ - Capture cyclic dependency.

df_train['week_sin'] = np.sin(2 * np.pi * df_train['week'] / 52.143)
df_train['week_cos'] = np.cos(2 * np.pi * df_train['week'] / 52.143)

df_test['week_sin'] = np.sin(2 * np.pi * df_test['week'] / 52.143)
df_test['week_cos'] = np.cos(2 * np.pi * df_test['week'] / 52.143)

Feature engineering - Price difference percentage.

df_train['price_diff_percent'] = \
        (df_train['base_price'] - df_train['checkout_price']) / df_train['base_price']

df_test['price_diff_percent'] = \
        (df_test['base_price'] - df_test['checkout_price']) / df_test['base_price']

Feature engineering - Convert the ad campaign features - ‘emailer_for_promotion’ and ‘homepage_featured’ into a single feature.

Both these features were originally boolean (0 and 1). So, adding them up to create a new feature does not require label encoding.

df_train['email_plus_homepage'] = df_train['emailer_for_promotion'] + df_train['homepage_featured']

df_test['email_plus_homepage'] = df_test['emailer_for_promotion'] + df_test['homepage_featured']

Prepare a list of features to train on. Split them into categorical and numerical features.

columns_to_train = ['week',
                    'week_sin',
                    'week_cos',
                    'checkout_price',
                    'base_price',
                    'price_diff_percent',
                    'email_plus_homepage',
                    'city_region_encoded',
                    'center_type_encoded',
                    'op_area',
                    'category_encoded',
                    'cuisine_encoded',
                    'center_id_encoded',
                    'meal_id_encoded']

categorical_columns = ['email_plus_homepage',
                       'city_region_encoded',
                       'center_type_encoded',
                       'category_encoded',
                       'cuisine_encoded',
                       'center_id_encoded',
                       'meal_id_encoded']

numerical_columns = [col for col in columns_to_train if col not in categorical_columns]

Log transform the target variable - num_orders.

df_train['num_orders_log1p'] = np.log1p(df_train['num_orders'])

I used the np.log1p() instead of np.log() because it is more numerically stable (i.e, log(0) is not defined).

Train + Cross-validation split.

The original dataset was already sorted by week number. I just had to pick the samples towards the end as the cross validation set. This corresponds to week numbers 141 through 145. Since we’re trying to predict orders at a future date, random shuffling of the dataset before split does not make sense and hence the shuffle=False.

X = df_train[categorical_columns + numerical_columns]
y = df_train['num_orders_log1p']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.02, shuffle=False)

Hyperparameter grid search.

scores = []
params = []

param_grid = {'num_leaves': [31, 127, 255],
              'min_child_samples': [5, 10, 30],
              'colsample_bytree': [0.4, 0.6, 0.8]}

for i, g in enumerate(ParameterGrid(param_grid)):
    print("param grid {}/{}".format(i, len(ParameterGrid(param_grid)) - 1))
    pprint.pprint(g)
    
    estimator = LGBMRegressor(learning_rate=0.003,
                              n_estimators=10000,
                              silent=False,
                              **g)
    
    fit_params = {'feature_name': categorical_columns + numerical_columns,
                  'categorical_feature': categorical_columns,
                  'eval_set': [(X_train, y_train), (X_test, y_test)]}

    estimator.fit(X_train, y_train, **fit_params)
    
    scores.append(estimator.best_score_['valid_1']['l2'])
    params.append(g)


print("Best score = {}".format(np.min(scores)))
print("Best params =")
print(params[np.argmin(scores)])

LightGBM is able to natively work with categorical features by specifying the categorical_feature parameter to the fit method. Also, I’ve stayed with the default evaluation metric for LightGBM regressor which is L2 (or MSE or Mean Squared Error).

Training the final LightGBM regression model on the entire dataset.

I used a method called early stopping to reduce overfitting. As a result, I cannot use the entire dataset for training. I will have to keep aside a test set for the purpose of early stopping.

The following model was trained using the best hyperparameters obtained by the parameter grid search step above.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.02, shuffle=False)

g = {'colsample_bytree': 0.4,
     'min_child_samples': 5,
     'num_leaves': 255}

estimator = LGBMRegressor(learning_rate=0.003,
                          n_estimators=40000,
                          silent=False,
                          **g)

fit_params = {'early_stopping_rounds': 1000,
              'feature_name': categorical_columns + numerical_columns,
              'categorical_feature': categorical_columns,
              'eval_set': [(X_train, y_train), (X_test, y_test)]}

estimator.fit(X_train, y_train, **fit_params)

Get predictions on the test data and prepare a submission file for the contest.

Since the target variable was log transformed using np.log1p(), the predicted num_orders will have to be inverse transformed using np.expm1().

X = df_test[categorical_columns + numerical_columns]

pred = estimator.predict(X)
pred = np.expm1(pred)

submission_df = df_test.copy()
submission_df['num_orders'] = pred
submission_df = submission_df[['id', 'num_orders']]
submission_df.to_csv('submission.csv', index=False)

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK