Building SciKitLearn Random Forest Model and Tuning Parameters without writing P...

Building SciKitLearn Random Forest Model and Tuning Parameters without writing Python Code

Random Forest is a supervised learning algorithm which can be used for classification and regression. In this article we go though a process of training a Random Forest model including auto parameter tuning without writing any Python code.We will use patient medical data to predict heart disease as an example use case.

The implementation is available in open source project avenir on github. Extensive use of configuration parameters enables the end user to use the solution without writing python code.

Random Forest

Random Forest consists of an ensemble of Decision Trees. Ensemble based models reduce error due to variance while holding the error due to bias. Tree ensembles can be built in the following ways.

Each tree uses a random subset of the training data set
At each split point a random sub set of the features

Random Forest is based on the second approach, which is essentially feature sub sampling. For classification, final prediction is based on majority votes by members. For regression, average or median of the predictions from the ensemble members is taken.

Heart Disease Data

The data set is generated artificially using ancestral sampling. The data set has the following fields

Patient ID (not used)
Sex
Age
Systolic blood pressure
Diastolic blood pressure
Smoker (boolean)
Diet (categorical with 3 values)
Physical activity per week (num of hours)
Education (num of years)
Ethnicity (categorical with 4 values)
Has heart disease (binary target variable)

There are 9 feature variables. After One Hot Binary Encoding of categorical variables we end up with 18 feature variables. For a real life model, lot more medical and personal data will be included.

Random Forest Model Building

To train and validate a model, edit the configuration as follows to set the right mode and the boolean parameter not to save the trained model common.mode=trainValidate
train.model.save=False

Next, run the following to train and perform k fold cross validation ./ rfd.py rfd.properties. Here is the output. It reports the error rate at the end.

running mode: trainValidate
...building random forest model
...training and kfold cross validating model
average error with k fold cross validation 00.048

Code Free Configuration Driven Usage

Code free usage is made possible by two essential artifacts of the solution. First, there is a Python wrapper class around ScikitLearn RandomForest implementation. Second, there is a configuration properties file, where all the configuration parameters are defined.

The configuration parameters belong to the following categories

Related to data e.g defining columns that are features
Related to model e.g directory and file name for saved model
Related to Random Forest algorithm

From a functional perspective, the parameters divided into 3 groups, train, validate and predict. The parameter names are prefixed with those words accordingly. Here is an an example properties file.

common.mode=trainValidate
common.model.directory=model
common.model.file=hd_rf_model
common.preprocessing=scale
common.verbose=_
train.data.file=hd_5000.txt
train.data.fields=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
train.data.feature.fields=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
train.data.class.field=18
train.validation=kfold
train.num.folds=5
train.num.trees=_
train.split.criterion=_
train.max.depth=_
train.min.samples.split=_
train.min.samples.leaf=_
train.min.weight.fraction.leaf=_
train.max.features=_
train.max.leaf.nodes=_
train.min.impurity.decrease=_
train.min.impurity.split=_
train.bootstrap=_
train.oob.score=_
train.num.jobs=_
train.random.state=100
train.verbose=_
train.warm.start=_
train.success.criterion=error
train.model.save=False
train.score.method=accuracy
train.search.param.strategy=_
train.search.params=train.search.learning.rate:float,train.search.num.estimators:int
train.search.learning.rate=0.14,0.16
train.search.num.estimators=140,160
train.auto.max.test.error=0.06
train.auto.max.error=0.08
train.auto.max.error.diff=0.02
predict.data.file=hd_100.txt
predict.data.fields=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
predict.data.feature.fields=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
predict.use.saved.model=True
validate.data.file=hd_100.txt
validate.data.fields=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
validate.data.feature.fields=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
validate.data.class.field=17
validate.use.saved.model=False
validate.score.method=confusionMatrix

Many of the parameters are self explanatory. I will also skip the Random Forest related parameters. You can consult ScikitLearn documentation for them. I will go over the remaining parameters.

common.mode – Defines the different modes of execution. Options are train, validate, trainValidate, trainValidateSearch, auto and predict
train.data.fields – Column in the data set used in training model
train.data.feature.fields – Columns that are features from the extracted columns
train.data.class.field – Column that is class or target from the extracted columns
train.success.criterion – Whether performance metric or it’s inverse to use
train.model.save – If True model is saved after training
train.score.method – Classification performance metric in train and validate mode
train.search.param.strategy – Parameter search strategy e.g, grid, random and simulated annealing
train.search.params – List of parameters used for search
train.auto.max.test.error – Maximum test error for auto mode
train.auto.max.error – Maximum error used for bias error checking in auto made
train.auto.max.error.diff – Maximum test and train error difference used for generalization error checking in auto mode
predict.data.fields – Similar to train.data.fields
predict.data.feature.fields – Similar to train.data.feature.fields
predict.use.saved.model – If True uses saved model in predict mode
validate.data.fields – Similar to train.data.fields
validate.data.feature.fields – Similar to train.data.feature.fields
validate.data.class – Similar to train.data.class
validate.use.saved.model – If True uses saved model in validate mode
validate.score.method – Classification performance metric in validate mode

The parameters train.validation through train.warm.start are essentially ScikitLearn RandomForest parameters. Please refer to ScikitLearn documentation for details of the algorithm parameters. The value _ for any parameter value, implies the default value for the parameter will be used.

The parameters train.search.learning.rate through train.search.num.estimators are dependent on the search parameter names listed through train.search.params. These parameter names are same as normal train parameter names, except that word search is inserted after train. These parameters specify a range of values for numeric parameters and a set of values for categorical parameters.

Test Validation and More

The execution mode is defined with the parameter common.mode may have any of the following

train – Performs training only. Return training error
validate – Performs validation for an already trained model. Reports test error
trainValidate – Performs train and validate both. Reports test error
trainValidateSearch – Performs train and validate while searching through parameter space. Reports minimum test error
auto – Performs the same function as in trainValidateSearch mode. In addition, trains the final model based on optimum set of parameter values. Reports train error in the final model
predict – Performs prediction with pre trained model.

Typically you will save only the final trained model after parameters are tuned. There are various ways you can tune parameters. We will discuss that next.

The mode auto is a one stop solution. After tuning parameters, it trains the final model based the final optimum parameter value set. There are two options for check on the final result

Based on test error only
Based on test error and train error difference and average

The average of test and train error reflects error due to bias and the difference reflects error due to variance.

There is no guarantee that in the auto mode, the best possible model will be found, as it will be indicated by the checks on the result listed above. Such scenario is possible for many reasons e.g

Incorrect parameter selection for search
Incorrect value range selection for parameters
Misconfigured optimizer

Hyper Parameter Search

There are three ways to search the hyper parameter search space. The first is manual brute force and not practical.

Run repeatedly in trainValidate mode, changing different RandomForest parameter values.
Run in trainValidateSearch mode, after setting the search algorithm of your choice, selecting the parameters for search and value range for each parameter.
Run in trainValidate mode, but use Hyperopt python optimizer. Hyperopt uses Bayesian Optimization

As I alluded to earlier, the first option is manual and not practical unless you happen to be lucky and hit the sweet spot in the parameter space after few trials. The second option supports 3 parameter search algorithms as implemented in avenir. The three choices are grid, random and simulated annealing. With the third option, Bayesian Optimization as implemented in Hyperopt python library is used for parameter tuning .

As parameter values are changed during search, the model complexity or capacity also changes which gets reflected in the test error. Test error initially decreases with model complexity and then increases again beyond certain level of complexity. In parameter search and tuning we try to capture that minimum point.

How do you know whether you have used enough training data for training the model. According to computational learning theory, a model of given complexity requires some minimum amount of training data. Only way to find out is to check if the test error drops with additional training data.

Wrapping Up

In this post we have seen how a RandomForest classification model can be trained (with parameter tuning) without writing any Python code. If you want to try it out, please follow the steps in the tutorial. To use the solution for another problem, all you have to do is to create an appropriate configuration file.