Synthetic Regression Data Generation in Python
source link: https://pkghosh.wordpress.com/2023/01/22/synthetic-regression-data-generation-in-python/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Synthetic Regression Data Generation in Python
In one of my projects, I needed to to generate synthetic data for a regression model. After looking around I could not find anything satisfactory, including Scikit-Learn. I wanted to have more control over the data generation process. I decided to implement my own data generator. It provides lot of control over the generation process, all configured in a configuration file. Hope you will find it useful. It’s available in Python package matumizi. For more details please refer to the GitHub repo whakapai.
Synthetic Data
Although ideally real data should be used for training any Machine Learning models, such data is not often available. The only option then is to generate synthetic data programmatically. After searching for such solutions, I found Scikit-learn to be the best, but still deficient in many ways. Here is a list of features I provide in my implementation, most of which are not available elsewhere. They are all specified in a configuration file, except for the user defined function. Most of these are not available in Scikit-Learn.
- The distributions to sample from for each predictor variable. Currently 5 different statistical distributions are supported
- Min and max values for each predictor variable. it’s used to range limit for sampled values. Also used internally for main max scaling
- Relative weights for linear terms for each predictor variable.
- Relative weights for square terms each predictor variable.
- Relative weights for cross terms each predictor variable pair.
- Parameters for correlated variables for each predictor variable pair.
- Bias value
- Noise (Uniform or Gaussian)
- User defined function for implementing any complex math function with the predictor variables
- Min and max values for the target variable. It’s used to scale the weights to make sure the target variable values fall within a specified range.
For relative weights, you can use any value as the base line e.g 1.0. The weights get scaled based on the specified range of target variable, while the relative values of the weights are maintained. If you want to include irrelevant variables, set the weights to zero or close to zero for them
Data Generation
The implementation is in the Python class RegressionDataGenerator in the module mlultil. the following steps are performed in the constructor
- Sample predictor variable values and calculate target variables.
- Take mean of the target variables.
- Using the mean of the sampled target variable and desired target variable range, scale all linear, square and cross term weights, to ensure the that the target variable values fall with the desired range. The relative values of the weights are maintained
After the object is constructed, you can call the sample() method to generate samples, once for each sample. It returns the predictor variable and target variable values. The steps in the sample() method are as follows
- Sample predictor variable values using the provided distributions
- Make correction for correlated variables. The correlation source variable value is used to compute the correlated variable value. The correlation involves bias, linear coefficient and zero mean Gaussian noise std deviation
- Range limit the predictor variable values
- Do min max scaling of the predictor variable values
- Compute target variable value using linear, square and cross term coefficients.
- Add noise to target variable value
- Call user callback function if provided and add the retuned value to target variable value
Min max scaling of the predictor variables is performed so that they are all within the same range. Once the variable values are scaled to the same range, the only influencers on the target variable value are the different weights. All the weights are scaled based on the mean of min max scaled predictor variable values and the desired mean of the target variable value.
To include irrelevant predictor variables, you could make the weight zero or some small number for that variable.
Example
The example is from eCommerce for regression model to predict how much a customer will spend in the next transaction.The predictor variables are as follows. The target variable is amount spent in the next transaction
- Income group
- Average transaction amount in last one year
- No of transactions in last one year
- Day of the week for a transaction
- Whether marketing campaign email was sent prior to the transaction
All the configurations for this are specified in a configuration file. You can use this as example. Let’s dissect that next
common.pvar.samplers=1:3:1:30:50:20:discrete:int,100:20:normal:float,1:8:1:10:20:50:70:85:100:60:30:discrete:int,1:7:1:60:40:30:50:70:95:120:discrete:int,0.5:0:1:bernauli:int Sampler for each predictor variabl, separated by coma. The colon separated string is the sampler data for predictor variable For example in 1:3:1:30:50:20:discrete:int, the sampler type is discrete. The sampler output is int. The beginning value is 1. the end value is 3. The step size is 1. The probability distr values are 30, 50 and 20 for the 3 discerete values common.pvar.ranges=1,3,30,200,1,8,1,7,0,1 The min max values for each predictor variable. two values per predictor variable common.linear.weights=1.2,1.4,1.0,1.2,1.5 Liner weights for the predictor variables common.square.weights=1,0.15 Square weights. in each pair the first number is the index of of the predictor variable. the second number is the weight common.crterm.weights=2,3,0.1 Cross term weights. In each triplet the first and second number are indexes of the predictor variables involved. The third number is the weight common.corr.params=0:1:40.0:30.0:.08:false The first number is correlation source variable index. The second number is the corelated variable. The third number is the bias. The fourth number is the linear coefficient. The fifth number is the std dev of a zero mean gausian distribution for noise. For additional correlation pairs, more colon separated strings can be provided separated by coma common.bias=20 Bias value common.noise=normal,.05 The first string is the type of distribution for noise, either uniform or gaussian. The second item is a number which is either the half range for uniform distribution or std deviation for zero mean gaussian distribution common.tvar.range=50,300 Min, max values for the target variable common.weight.niter=200 No of samples for scaling weights
The only item not specified in the configuration is the callback function. The callback function when called is provided with all the predictor variable values as a list in the argument. The function performs some user defined computation and returns the result. The returned value contributes toward the final target value. Here is some sample output. Driver code is available.
3,116.17,3,7,0,123.66 2,105.83,1,6,1,129.32 1,63.51,4,1,1,87.05 2,101.71,3,2,1,119.81 2,84.50,6,7,0,113.05 2,86.69,1,1,0,54.83 2,97.17,4,6,0,100.29 1,63.82,6,7,0,85.06
This is how I ran the generator using the Python driver module
python3 mcamp.py --op gen --genconf mcamp.properties --nsamp 2000
Summing Up
We have gone through a solution for data generation for a regression problem. It provides many features not available else where. All that is required is a configuration file, similar to the one provided as an example.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK