Synthetic Regression Data Generation in Python

In one of my projects, I needed to to generate synthetic data for a regression model. After looking around I could not find anything satisfactory, including Scikit-Learn. I wanted to have more control over the data generation process. I decided to implement my own data generator. It provides lot of control over the generation process, all configured in a configuration file. Hope you will find it useful. It’s available in Python package matumizi. For more details please refer to the GitHub repo whakapai.

Synthetic Data

Although ideally real data should be used for training any Machine Learning models, such data is not often available. The only option then is to generate synthetic data programmatically. After searching for such solutions, I found Scikit-learn to be the best, but still deficient in many ways. Here is a list of features I provide in my implementation, most of which are not available elsewhere. They are all specified in a configuration file, except for the user defined function. Most of these are not available in Scikit-Learn.

The distributions to sample from for each predictor variable. Currently 5 different statistical distributions are supported
Min and max values for each predictor variable. it’s used to range limit for sampled values. Also used internally for main max scaling
Relative weights for linear terms for each predictor variable.
Relative weights for square terms each predictor variable.
Relative weights for cross terms each predictor variable pair.
Parameters for correlated variables for each predictor variable pair.
Bias value
Noise (Uniform or Gaussian)
User defined function for implementing any complex math function with the predictor variables
Min and max values for the target variable. It’s used to scale the weights to make sure the target variable values fall within a specified range.

For relative weights, you can use any value as the base line e.g 1.0. The weights get scaled based on the specified range of target variable, while the relative values of the weights are maintained. If you want to include irrelevant variables, set the weights to zero or close to zero for them

Data Generation

The implementation is in the Python class RegressionDataGenerator in the module mlultil. the following steps are performed in the constructor

Sample predictor variable values and calculate target variables.
Take mean of the target variables.
Using the mean of the sampled target variable and desired target variable range, scale all linear, square and cross term weights, to ensure the that the target variable values fall with the desired range. The relative values of the weights are maintained

After the object is constructed, you can call the sample() method to generate samples, once for each sample. It returns the predictor variable and target variable values. The steps in the sample() method are as follows

Sample predictor variable values using the provided distributions
Make correction for correlated variables. The correlation source variable value is used to compute the correlated variable value. The correlation involves bias, linear coefficient and zero mean Gaussian noise std deviation
Range limit the predictor variable values
Do min max scaling of the predictor variable values
Compute target variable value using linear, square and cross term coefficients.
Add noise to target variable value
Call user callback function if provided and add the retuned value to target variable value

Min max scaling of the predictor variables is performed so that they are all within the same range. Once the variable values are scaled to the same range, the only influencers on the target variable value are the different weights. All the weights are scaled based on the mean of min max scaled predictor variable values and the desired mean of the target variable value.

To include irrelevant predictor variables, you could make the weight zero or some small number for that variable.

Example

The example is from eCommerce for regression model to predict how much a customer will spend in the next transaction.The predictor variables are as follows. The target variable is amount spent in the next transaction

Income group
Average transaction amount in last one year
No of transactions in last one year
Day of the week for a transaction
Whether marketing campaign email was sent prior to the transaction

All the configurations for this are specified in a configuration file. You can use this as example. Let’s dissect that next

common.pvar.samplers=1:3:1:30:50:20:discrete:int,100:20:normal:float,1:8:1:10:20:50:70:85:100:60:30:discrete:int,1:7:1:60:40:30:50:70:95:120:discrete:int,0.5:0:1:bernauli:int
Sampler for each predictor variabl, separated by coma. The colon separated string is the sampler data for predictor variable
For example in 1:3:1:30:50:20:discrete:int, the sampler type is discrete. The sampler output is int. The beginning value is 1.
the end value is 3. The step size is 1. The probability distr values are 30, 50 and 20 for the 3 discerete values

common.pvar.ranges=1,3,30,200,1,8,1,7,0,1
The min max values for each predictor variable. two values per predictor variable

common.linear.weights=1.2,1.4,1.0,1.2,1.5
Liner weights for the predictor variables

common.square.weights=1,0.15
Square weights. in each pair the first number is the index of of the predictor variable. the second number is the weight

common.crterm.weights=2,3,0.1
Cross term weights. In each triplet the first and second number are indexes of the predictor variables involved.
The third number is the weight

common.corr.params=0:1:40.0:30.0:.08:false
The first number is correlation source variable index. The second number is the corelated variable. The third number
is the bias. The fourth number is the linear coefficient. The fifth number is the std dev of a zero mean gausian distribution
for noise. For additional correlation pairs, more colon separated strings can be provided separated by coma

common.bias=20
Bias value

common.noise=normal,.05
The first string is the type of distribution for noise, either uniform or gaussian. The second item is a number
which is either the half range for uniform distribution or std deviation for zero mean gaussian distribution

common.tvar.range=50,300
Min, max values for the target variable

common.weight.niter=200
No of samples for scaling weights

The only item not specified in the configuration is the callback function. The callback function when called is provided with all the predictor variable values as a list in the argument. The function performs some user defined computation and returns the result. The returned value contributes toward the final target value. Here is some sample output. Driver code is available.

3,116.17,3,7,0,123.66
2,105.83,1,6,1,129.32
1,63.51,4,1,1,87.05
2,101.71,3,2,1,119.81
2,84.50,6,7,0,113.05
2,86.69,1,1,0,54.83
2,97.17,4,6,0,100.29
1,63.82,6,7,0,85.06

This is how I ran the generator using the Python driver module

python3 mcamp.py --op gen --genconf mcamp.properties --nsamp 2000

Summing Up

We have gone through a solution for data generation for a regression problem. It provides many features not available else where. All that is required is a configuration file, similar to the one provided as an example.

Synthetic Regression Data Generation in Python