How to compare an APL model to a non-APL model ─ Part 2

After completing the first part of the blog you should have a hold-out dataset in HANA dedicated to test, for both the Census case, and the California Housing case. In this second part we will build an APL model and a non-APL model on the same training data. Predictions will be made against the hold-out dataset to ensure a fair comparison between the two models. We will use standard metrics to measure the accuracy of our classification models and our regression models.

Regression Use Case

We define the HANA dataframes for training and for test using the tables prepared during part 1:

from hana_ml import dataframe as hd
conn = hd.ConnectionContext(userkey='MLMDA_KEY')
# Sorted data for training
sql_cmd = 'SELECT * FROM "APL_SAMPLES"."HOUSING_TRAIN" ORDER BY 1'
train_remote = hd.DataFrame(conn, sql_cmd)
# Data for test without the known target
test_remote = conn.table('HOUSING_TEST', schema='APL_SAMPLES').drop([target_col])

We train an APL regression model on the train table, keeping the default parameters:

target_col = 'Price'
key_col = 'Id'
# Learning with APL
from hana_ml.algorithms.apl.gradient_boosting_regression import GradientBoostingRegressor
apl_model = GradientBoostingRegressor()
apl_model.fit(train_remote, label=target_col, key=key_col)

We make predictions using the hold-out dataset as recommended in the first part:

apl_pred_remote = apl_model.predict(test_remote).rename_columns(['ID', 'PREDICTED'])
apl_pred_remote.save(('HOUSING_APL_PREDICTIONS'), force = True)

We estimate the accuracy of the predictions with the MAE and RMSE indicators:

sql_cmd =  """
select 
round(avg(abs(P."PREDICTED" - A."Price")),4) as "Mean Absolute Error",
round(sqrt(avg(power(P."PREDICTED" - A."Price",2))),4) as "Root Mean Square Error"
from "USER_APL"."HOUSING_APL_PREDICTIONS" P, "APL_SAMPLES"."HOUSING_TEST" A
where P."ID" = A."Id"
"""
apl_err_remote = hd.DataFrame(conn, sql_cmd)
apl_err_remote.collect()

You may want to define a naïve regressor that will serve as a baseline; in the following example the mean price is our predicted value:

sql_cmd =  """
With 
NAIVE as (
select avg("Price") as PREDICTED from "APL_SAMPLES"."HOUSING_TEST"
) 
select 
round(avg(abs(P."PREDICTED" - A."Price")),4) as "Mean Absolute Error",
round(sqrt(avg(power(P."PREDICTED" - A."Price",2))),4) as "Root Mean Square Error"
from NAIVE P, "APL_SAMPLES"."HOUSING_TEST" A
"""
naive_err_remote = hd.DataFrame(conn, sql_cmd)
naive_err_remote.collect()

As expected, the naïve regressor makes much larger errors than the APL regressor:

Now, let’s try the random forest technique from scikit-learn; again, we keep the default parameters:

x_train = df_train.drop(columns=[target_col, key_col])
y_train = df_train[target_col]
# Learning with random forest
from sklearn.ensemble import RandomForestRegressor
rdf_model = RandomForestRegressor()
rdf_model.fit(x_train, y_train)
# Make Predictions on hold-out
df_test = conn.table('HOUSING_TEST', schema='APL_SAMPLES').collect()
x_test = df_test.drop(columns=[target_col, key_col])
y_test = df_test[target_col]
rdf_pred_df = rdf_model.predict(x_test)

Here are the MAE and the RMSE values:

import numpy as np
# MAE
rdf_abs_err_df = abs(rdf_pred_df - y_test)
rdf_mae = np.mean(rdf_abs_err_df)
print('MAE : %.3f' % rdf_mae)
# RMSE
rdf_square_err_df = (rdf_pred_df - y_test)**2
rdf_rmse = np.sqrt(np.mean(rdf_square_err_df))
print('RMSE: %.3f' % rdf_rmse)

Our two trained models do much better than the mean price naïve approach. The APL gradient boosting model makes less errors than the random forest model; this comes with no surprise since gradient boosting trees are usually more accurate than random forests.

Classification Use Case

For building the APL classification model, we follow the same steps that we used in the regression case.

Step 1 – Define the HANA datafames:

# Sorted data for training
sql_cmd = 'SELECT * FROM "APL_SAMPLES"."CENSUS_TRAIN" ORDER BY 1'
train_remote = hd.DataFrame(conn, sql_cmd))
# Data for test without the known target
test_remote = conn.table('CENSUS_TEST', schema='APL_SAMPLES').drop([target_col])

Step 2 – APL model fit on the training data:

target_col = 'class'
key_col = 'id'
# Learning with APL
from hana_ml.algorithms.apl.gradient_boosting_classification import GradientBoostingBinaryClassifier
apl_model = GradientBoostingBinaryClassifier()
apl_model.fit(train_remote, label=target_col, key=key_col)

Step 3 – Predict the target using the hold-out data:

apl_model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'AllProbabilities'})
apl_pred_remote = apl_model.predict(test_remote).rename_columns(['ID', 'PREDICTED', 'PROBA'])

Step 4 – Evaluate the accuracy of the predictions:

sql_cmd =  """
select A."id", A."class" as ACTUAL, P."PREDICTED", P."PROBA"
from "USER_APL"."CENSUS_APL_PREDICTIONS" P, "APL_SAMPLES"."CENSUS_TEST" A
where P."ID" = A."id"
"""
apl_pred_remote = hd.DataFrame(conn, sql_cmd)
apl_pred_df = apl_pred_remote.collect()

For classification accuracy metrics we will leverage scikit-learn functions:

from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.metrics import RocCurveDisplay

Here is a typical report to assess the accuracy of our APL classifier:

print(classification_report(
    y_true=(apl_pred_df.ACTUAL==1),
    y_pred=(apl_pred_df.PREDICTED==1)
))

One can draw the ROC curve and display the AUC value:

RocCurveDisplay.from_predictions(
    y_true=(apl_pred_df.ACTUAL==1),
    y_pred=(apl_pred_df.PROBA)
)
plt.title('APL Model')
plt.plot([0, 1], [0, 1], "k--", label="Random Guess")
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.grid()
plt.show()

To conclude, we will define a naïve classifier where the predicted class is the most frequent class.

sql_cmd =  """
With
NAIVE as (
select top 1 "class" as PREDICTED, count(*) as N
from "APL_SAMPLES"."CENSUS_TEST"
group by "class" order by 2 desc
)
select A."id", A."class" as ACTUAL, P."PREDICTED"
from NAIVE P, "APL_SAMPLES"."CENSUS_TEST" A
"""
naive_pred_remote = hd.DataFrame(conn, sql_cmd)
naive_pred_df = naive_pred_remote.collect()

We evaluate our naïve classifier and run the same report used earlier for APL:

print(classification_report(
    y_true=(naive_pred_df.ACTUAL==1),
    y_pred=(naive_pred_df.PREDICTED==1)
))

This naïve classifier predicts class 0 always, the most frequent class in the train dataset:

df_remote = conn.table('CENSUS_TRAIN', schema='APL_SAMPLES')
df = df_remote.agg([('count', 'class', 'rows')], group_by='class').collect()
df['Train percent'] = (df['rows'] / df['rows'].sum()) * 100
df.style.hide(axis='index')

The test dataset was built so that it preserves the distribution of the class:

df_remote = conn.table('CENSUS_TEST', schema='APL_SAMPLES')
df = df_remote.agg([('count', 'class', 'rows')], group_by='class').collect()
df['Test percent'] = (df['rows'] / df['rows'].sum()) * 100
df.style.hide(axis='index')

Our naîve classifier makes a right prediction 76% of the time.

This is the end of our blog series. For those who want to continue by building a random forest classifier on Census data, be aware that you will need to convert the non-numerical features (e.g., marital status) into numerical features.

To know more about APL

How to compare an APL model to a non-APL model ─ Part 2

How to compare an APL model to a non-APL model ─ Part 2

Regression Use Case

Classification Use Case

Recommend

腾讯只有「远患」，没有「近忧」

US Supreme Court kicks out Isis lawsuit that would have transformed the internet

跨平台的划词翻译软件-Pot

音响品牌玩“跨界”，B&O将发售45万的97寸OLED电视

Southeast Asia’s Ride-Hailing App Grab Sees Users Pulling Back

OPPO芯局解散后，我们捋了遍中国手机厂商造芯往事

Raising capital for inclusive design

“中国法研杯”司法人工智能挑战赛：基于UTC的多标签/层次分类小样本文本应用，Macro F1...

Bounty: Break the Data Transfer Verification Algorithm

Updates from the 96th TC39 meeting

About Joyk