8

Learn about Your Data with about Seventy Data Exploration Functions All in One P...

 3 years ago
source link: https://pkghosh.wordpress.com/2020/07/13/learn-about-your-data-with-about-seventy-data-exploration-functions-all-in-one-python-class/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Learn about Your Data with about Seventy Data Exploration Functions All in One Python Class

It’s a costly mistake to jump straight  into building machine learning models before getting a good insight into your data. I have made the mistake and paid the price.  Since then I made a resolution to learn about the data as much as possible first before  taking the next step. While exploring data, I always found myself using multiple python libraries and doing plethora of  imports for various python modules.

That experience motivated me to consolidate all common python data exploration functions, in one python class to make it easier to use. As an added feature I have also provided a workspace like interface, using which you can register multiple data sets with user provided name for each data set. You can refer to the data sets by name and perform various operations. The python implementation is available in my open source  project avenir in GitHub

API  Usage

Most of the data exploration function implementation is based on existing python libraries. Very few are implemented from scratch. Following python libraries have been used.

  1. numpy
  2. scipy
  3. pandas
  4. statsmodels
  5. scikit-learn

There two kinds of API. Most are for data exploration. Rest of them are for workspace management. The API usage for data exploration has the following pattern.

  1. Create an instance of DataExplorer
  2. Register multiple data sets. A data set is a list or array with a name. A data source can be a file, pandas data frame, bumpy array or a list. A data set is essentially a 1D  array with a name. For example you can pass   a CSV file, specifying the columns you want to use. For each column it will register a  data set.
  3. Call various data exploration API among the 66 available. Names of one or  data sets are always passed as an argument
  4. Result is alway returned as python dictionary
  5. By default, there is alway console output. It can be disabled by setting the argument verbose to False in the constructor 
  6. You can add notes for any data set registered as you are exploring it.
  7. The whole workspace can be saved and restored, if you want to continue your exploration session later. The workspace consists of a dictionary holding all the data sets and a dictionary holding the meta data formal data sets

The source code has comments on the input arguments for each function. For further details, it’s best to refer to the documentation of the base python library used for any particular function.

Workspace Management API

Here are the functions for workspace management. Through these, you can load data from various sources, save and restore workspace. You need to  use these to register various data sets before you can operate on them, although data exploration API allows you to pass any unregistered list or numpy array also.

Function Comment save(filePath) save workspace restore(filePath) restore workspace queryFileData(filePath, *columns) query data types for file columns queryDataFrameData(df, *columns) query data types for data frame columns getDataType(col) query data types for a data set (numeric, binary, categorical addFileNumericData(filePath, *columns) add numeric columns from file addFileBinaryData(filePath, *columns) add binary columns from file addDataFrameNumericData(filePath, *columns) add numeric columns from data frame addDataFrameBinaryData(filePath, *columns) add binary columns from data frame addListNumericData(ds, name) add numeric data from list addListBinaryData(ds, name) add binary data from list addFileCatData(filePath, *columns) add categorical columns from file addDataFrameCatData(df, *columns) add categorical columns from data frame addCatListData(ds, name) add categorical data from list remData(ds) remove data set addNote(ds, note) add note for a data set getNotes(ds) get notes for a data set getNumericData(ds) get numeric data for a data set getCatData(ds) get categorical data for a data set showNames() get list of names for datasets registered getCatData(ds) get categorical data for a data set

Data Exploration API.

Rest of the sections list of all the data exploration functions. They are split into two separate section 1) Summary Statistics and 2)Test Statistics. The API has the following characteristics

  • Since we are exploring data and learning insight, the functions don’t mutate the data
  • The functions for adding data are data type aware. If you try to use an invalid data type for a function e.g cross correlation with categorical data, it will be detected and an AssertionError will be raised
  • You will always pass one for names for data sets already registered. However, you may  pass any unregistered list or numpy array also.
  • If the underlying library returns pvalue, then the output will indicate the null hypothesis is accepted or rejected, based on the critical value passed
  • Some of the functions take two data sets and require that the the data sets be of same size. In such cases, the size check is done
  • By default there is always console output. To disable console output, you should set verbose to False in the constructor.

Following data types are supported

  1. Numerical (integer, float)
  2. Binary (integer with values 0 and 1)
  3. Categorical (integer, string)

Summary Statistics API

The functions listed below belong to 3 sub categories. Most functions will return some result wrapped in dictionary. Some will do plotting.

  1. Basic summary statistics
  2. Frequency related statistics
  3. Correlation
Function Comment queryFileData(filePath, *columns) query column data type from a data file queryDataFrameData(df, *columns) query column data type from a data frame plot(ds, yscale=None) line plot scatterPlot(ds1, ds2) scatter plot print(self, ds) prints size of data set and first 50 elements plotHist(ds, cumulative, density, nbins=None) plots histogram or cumulative distribution isMonotonicallyChanging(ds) checks if data is monotonically increasing or decreasing getFeqDistr(ds, nbins=10) gets frequency distribution or histogram getCumFreqDistr(ds, nbins=10) gets cumulative frequency distribution getEntropy(ds, nbins=10) gets entropy getRelEntropy(ds1, ds2, nbins=10) gets relativ entropy getMutualInfo(ds1, ds2, nbins=10) gets mutual information getPercentile(ds, value) gets percentile getValueAtPercentile(ds, percent) gets value at percentile getUniqueValueCounts(ds, maxCnt=10) gets unique values and counts getCatUniqueValueCounts(ds, maxCnt=10) gets categorical data unique values and counts getStats(ds, nextreme=5) gets summary statistics getDifference(self, ds, order) gets difference of given order getTrend(ds, doPlot=False) gets trend deTrend(self, ds, trend, doPlot=False) gets trend removed data getTimeSeriesComponents(ds, model, freq, summaryOnly, doPlot=False) gets trend, cycle and residue components of time series getOutliersWithIsoForest(contamination, *dsl) gets outliers with isolation forest getOutliersWithLocalFactor(contamination, *dsl) gets outliers with local factor getOutliersWithSupVecMach(nu, *dsl) gets outliers using one class svm fitLinearReg(ds, doPlot=False) get linear regression coefficients fitSiegelRobustLinearReg(ds, doPlot=False) gets siegel robust linear regression coefficients based on median fitTheilSenRobustLinearReg(ds, doPlot=False) gets thiel sen robust linear fit regression coefficients based on median plotRegFit(x, y, slope, intercept) plots regression fitted line getCovar(*dsl) gets covariance getPearsonCorr(ds1, ds2, sigLev=.05) gets pearson correlation coefficient getSpearmanRankCorr(ds1, ds2, sigLev=.05) gets spearman correlation coefficient getKendalRankCorr(ds1, ds2, sigLev=.05) gets kendall’s tau, correlation for ordinal data getPointBiserialCorr(ds1, ds2, sigLev=.05) gets point biserial correlation between binary and numeric data getConTab(ds1, ds2) gets contingency table forcategorical data pair getChiSqCorr(ds1, ds2, sigLev=.05) gets chi square correlation for categorical data getAnovaCorr(ds1, ds2, grByCol, sigLev=.05) gets anova correlation for numerical and categorical data plotAutoCorr(ds, lags, alpha, diffOrder=0) plots auto correlation getAutoCorr(ds, lags, alpha=.05) gets auts correlation plotParAcf(ds, lags, alpha) plots partial auto correlation getParAutoCorr(ds, lags, alpha=.05) gets partial auts correlation plotCrossCorr(ds1, ds2, normed, lags) plots cross correlation getCrossCorr(ds1, ds2) gets cross correlation getFourierTransform(ds) gets fast fourier transform getNullCount(ds) gets count of null (None, nan) values getValueRangePercentile(ds, value1, value2) gets percentile difference for value range getLessThanValues(ds, cvalue) gets values less than given value getGreaterThanValues(ds, cvalue) gets values greater than given value getGausianMixture(ncomp, cvType, ninit, *dsl) gets parameters of Gaussian mixture components getKmeansCluster(nclust, ninit, *dsl) gets cluster parameters with kmeans clustering getOutliersWithCovarDeterminant(contamination, *dsl) gets gets outliers using covariance determinant

The function getStats() packs lot of  statistic on the data in it’s return value as below

  1. Data size
  2. Min value
  3. Max value
  4. Smallest n values
  5. Largest n values
  6. Mean
  7. Median
  8. Mode
  9. Mode count
  10. Std deviation
  11. Skew
  12. Kurtosis
  13. Median absolute deviation

Test Statistics API

These functions perform tests for various statistical properties as below.

  1. Fitness test for various distributions
  2. Stationary test
  3. Two sample statistic test
Function Comment testStationaryAdf(ds, regression, autolag, sigLev=.05) ADF stationary test testStationaryKpss(ds, regression, nlags, sigLev=.05) KPSS stationary test testNormalJarqBera(ds, sigLev=.05) Jarque Bera normalcy test testNormalShapWilk(ds, sigLev=.05) Shapiro Wilks normalcy test testNormalDagast(ds, sigLev=.05) D’Agostino’s K square normalcy test testNormalShapWilk(ds, sigLev=.05) Shapiro Wilks normalcy test testDistrAnderson(ds, dist, sigLev=.05) Anderson test for normal,expon,logistic,gumbel,gumbel_l,gumbel_r testSkew(ds, sigLev=.05) test skew for normal distr testTwoSampleStudent(ds1, ds2, sigLev=.05) Student t 2 sample test testTwoSampleKs(ds1, ds2, sigLev=.05) Kolmogorov Sminov 2 sample statistic test testTwoSampleMw(ds1, ds2, sigLev=.05) Mann-Whitney 2 sample statistic test testTwoSampleWilcox(ds1, ds2, sigLev=.05) Wilcoxon Signed-Rank 2 sample statistic test testTwoSampleKw(ds1, ds2, sigLev=.05) Kruskal-Wallis 2 sample statistic test testTwoSampleFriedman(ds1, ds2, ds3, sigLev=.05) Friedman 2 sample statistic test testTwoSampleEs(ds1, ds2, sigLev=.05) Epps Singleton 2 sample statistic test testTwoSampleAnderson(ds1, ds2, sigLev=.05) Anderson 2 sample statistic test testTwoSampleScaleAb(ds1, ds2, sigLev=.05) Ansari Bradley 2 sample scale statistic test testTwoSampleScaleMood(ds1, ds2, sigLev=.05) Mood 2 sample scale statistic test testTwoSampleVarBartlet(ds1, ds2, sigLev=.05) Ansari Bradley 2 sample scale statistic test testTwoSampleVarLevene(ds1, ds2, sigLev=.05) Levene 2 sample variance statistic test testTwoSampleVarFk(ds1, ds2, sigLev=.05) Fligner-Killeen 2 sample variance statistic test testTwoSampleMedMood(ds1, ds2, sigLev=.05) Mood 2 sample median statistic test testTwoSampleZc(ds1, ds2, sigLev=.05) Zhang-C 2 sample statistic statistic test testTwoSampleZa(ds1, ds2, sigLev=.05) Zhang-A 2 sample statistic test testTwoSampleZk(ds1, ds2, sigLev=.05) Zhang-K 2 sample statistic testTwoSampleCvm(ds1, ds2, sigLev=.05) CVM sample statistic test

Usage Examples

In this section, we will go through examples of API usage. For each I will provide the example code and the result. Please refer to the tutorial for more examples.

The first one is summary statistic as below.  It adds 2 data sets corresponding to 2 columns in a file containing supply chain demand data and then calls getStats().

sys.path.append(os.path.abspath("../mlextra"))
from daexp import *

exp = DataExplorer()
exp.addFileNumericData("bord.txt", 0, 1, "pdemand", "demand")
exp.getStats("pdemand")

output:
== adding numeric columns from a file ==
done

== getting summary statistics for data sets pdemand ==
{   'kurtosis': -0.12152386739702337,
    'length': 1000,
    'mad': 2575.2762,
    'max': 18912,
    'mean': 10920.908,
    'median': 11011.5,
    'min': 3521,
    'mode': 10350,
    'mode count': 3,
    'n largest': [18912, 18894, 17977, 17811, 17805],
    'n smallest': [3521, 3802, 4185, 4473, 4536],
    'skew': -0.009681701835865877,
    'std': 2569.1597609989144}

In the next example, we will analyze retails daily sales data. The data has weekly seasonality. In auto correlation we expect to find a large peak at lag 7. Let’s find out

sys.path.append(os.path.abspath("../mlextra"))
from daexp import *

exp = DataExplorer()
exp.addFileNumericData("sale.txt", 0, "sale")
exp.getAutoCorr("sale", 20)

output:
== adding numeric columns from a file ==
done

== getting auto correlation for data sets sale ==
result details:
{   'autoCorr': array([ 1.        ,  0.5738174 , -0.20129608, -0.82667856, -0.82392299,
       -0.20331679,  0.56991343,  0.91427488,  0.5679168 , -0.20108015,
       -0.81710428, -0.8175842 , -0.20391004,  0.56864915,  0.90936982,
        0.56528676, -0.20657182, -0.81111562, -0.81204275, -0.1970099 ,
        0.56175539]),
    'confIntv': array([[ 1.        ,  1.        ],
       [ 0.5118379 ,  0.6357969 ],
       [-0.28111578, -0.12147637],
       [-0.90842511, -0.74493201],
       [-0.93316119, -0.71468479],
       [-0.33426918, -0.07236441],
       [ 0.43775398,  0.70207288],
       [ 0.77298956,  1.0555602 ],
       [ 0.40548625,  0.73034734],
       [-0.37096731, -0.03119298],
       [-0.98790327, -0.64630529],
       [-1.00279183, -0.63237657],
       [-0.40249873, -0.00532136],
       [ 0.36925779,  0.76804052],
       [ 0.70384298,  1.11489665],
       [ 0.34484471,  0.7857288 ],
       [-0.43251377,  0.01937013],
       [-1.03778192, -0.58444933],
       [-1.04959751, -0.57448798],
       [-0.44499878,  0.05097898],
       [ 0.313166  ,  0.81034477]])}

As expected the largest peat is ay=t lag 0. The next largest peak is at lag 7 with a value 0.91427488.

Finally with the knowledge of seasonal period we can extract the time series components as below

#code same as in the last example 
exp.getTimeSeriesComponents("sale","additive", 7, True, False)

output:
== adding numeric columns from a file ==
done

== extracting trend, cycle and residue components of time series for data sets sale ==
result details:
{   'residueMean': 0.022420235699977295,
    'residueStdDev': 19.14825253159541,
    'seasonalAmp': 98.22786720321932,
    'trendMean': 1004.9323081345215,
    'trendSlope': -0.0048913825348870996}

The average value is in the trend mean. Trend has a small negative slope. Seasonality has amplitude of 98.227. Residue has mean and standard deviation. Because we set the 4th argument to True we got summary of the time series components. If it was False, the function would have returned the actual values of the 3 components.

Wrapping Up

We have gone through a python data exploration API with close to 70 functions. It should be easy to build an web application based on the API. Please refer to the tutorial document for more examples on how to use the API. Hope you find it useful. If you have any suggestion for new functions in the API , please let me know.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK