Learn about Your Data with about Seventy Data Exploration Functions All in One P...
source link: https://pkghosh.wordpress.com/2020/07/13/learn-about-your-data-with-about-seventy-data-exploration-functions-all-in-one-python-class/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Learn about Your Data with about Seventy Data Exploration Functions All in One Python Class
It’s a costly mistake to jump straight into building machine learning models before getting a good insight into your data. I have made the mistake and paid the price. Since then I made a resolution to learn about the data as much as possible first before taking the next step. While exploring data, I always found myself using multiple python libraries and doing plethora of imports for various python modules.
That experience motivated me to consolidate all common python data exploration functions, in one python class to make it easier to use. As an added feature I have also provided a workspace like interface, using which you can register multiple data sets with user provided name for each data set. You can refer to the data sets by name and perform various operations. The python implementation is available in my open source project avenir in GitHub
API Usage
Most of the data exploration function implementation is based on existing python libraries. Very few are implemented from scratch. Following python libraries have been used.
- numpy
- scipy
- pandas
- statsmodels
- scikit-learn
There two kinds of API. Most are for data exploration. Rest of them are for workspace management. The API usage for data exploration has the following pattern.
- Create an instance of DataExplorer
- Register multiple data sets. A data set is a list or array with a name. A data source can be a file, pandas data frame, bumpy array or a list. A data set is essentially a 1D array with a name. For example you can pass a CSV file, specifying the columns you want to use. For each column it will register a data set.
- Call various data exploration API among the 66 available. Names of one or data sets are always passed as an argument
- Result is alway returned as python dictionary
- By default, there is alway console output. It can be disabled by setting the argument verbose to False in the constructor
- You can add notes for any data set registered as you are exploring it.
- The whole workspace can be saved and restored, if you want to continue your exploration session later. The workspace consists of a dictionary holding all the data sets and a dictionary holding the meta data formal data sets
The source code has comments on the input arguments for each function. For further details, it’s best to refer to the documentation of the base python library used for any particular function.
Workspace Management API
Here are the functions for workspace management. Through these, you can load data from various sources, save and restore workspace. You need to use these to register various data sets before you can operate on them, although data exploration API allows you to pass any unregistered list or numpy array also.
Function Comment save(filePath) save workspace restore(filePath) restore workspace queryFileData(filePath, *columns) query data types for file columns queryDataFrameData(df, *columns) query data types for data frame columns getDataType(col) query data types for a data set (numeric, binary, categorical addFileNumericData(filePath, *columns) add numeric columns from file addFileBinaryData(filePath, *columns) add binary columns from file addDataFrameNumericData(filePath, *columns) add numeric columns from data frame addDataFrameBinaryData(filePath, *columns) add binary columns from data frame addListNumericData(ds, name) add numeric data from list addListBinaryData(ds, name) add binary data from list addFileCatData(filePath, *columns) add categorical columns from file addDataFrameCatData(df, *columns) add categorical columns from data frame addCatListData(ds, name) add categorical data from list remData(ds) remove data set addNote(ds, note) add note for a data set getNotes(ds) get notes for a data set getNumericData(ds) get numeric data for a data set getCatData(ds) get categorical data for a data set showNames() get list of names for datasets registered getCatData(ds) get categorical data for a data setData Exploration API.
Rest of the sections list of all the data exploration functions. They are split into two separate section 1) Summary Statistics and 2)Test Statistics. The API has the following characteristics
- Since we are exploring data and learning insight, the functions don’t mutate the data
- The functions for adding data are data type aware. If you try to use an invalid data type for a function e.g cross correlation with categorical data, it will be detected and an AssertionError will be raised
- You will always pass one for names for data sets already registered. However, you may pass any unregistered list or numpy array also.
- If the underlying library returns pvalue, then the output will indicate the null hypothesis is accepted or rejected, based on the critical value passed
- Some of the functions take two data sets and require that the the data sets be of same size. In such cases, the size check is done
- By default there is always console output. To disable console output, you should set verbose to False in the constructor.
Following data types are supported
- Numerical (integer, float)
- Binary (integer with values 0 and 1)
- Categorical (integer, string)
Summary Statistics API
The functions listed below belong to 3 sub categories. Most functions will return some result wrapped in dictionary. Some will do plotting.
- Basic summary statistics
- Frequency related statistics
- Correlation
The function getStats() packs lot of statistic on the data in it’s return value as below
- Data size
- Min value
- Max value
- Smallest n values
- Largest n values
- Mean
- Median
- Mode
- Mode count
- Std deviation
- Skew
- Kurtosis
- Median absolute deviation
Test Statistics API
These functions perform tests for various statistical properties as below.
- Fitness test for various distributions
- Stationary test
- Two sample statistic test
Usage Examples
In this section, we will go through examples of API usage. For each I will provide the example code and the result. Please refer to the tutorial for more examples.
The first one is summary statistic as below. It adds 2 data sets corresponding to 2 columns in a file containing supply chain demand data and then calls getStats().
sys.path.append(os.path.abspath("../mlextra")) from daexp import * exp = DataExplorer() exp.addFileNumericData("bord.txt", 0, 1, "pdemand", "demand") exp.getStats("pdemand") output: == adding numeric columns from a file == done == getting summary statistics for data sets pdemand == { 'kurtosis': -0.12152386739702337, 'length': 1000, 'mad': 2575.2762, 'max': 18912, 'mean': 10920.908, 'median': 11011.5, 'min': 3521, 'mode': 10350, 'mode count': 3, 'n largest': [18912, 18894, 17977, 17811, 17805], 'n smallest': [3521, 3802, 4185, 4473, 4536], 'skew': -0.009681701835865877, 'std': 2569.1597609989144}
In the next example, we will analyze retails daily sales data. The data has weekly seasonality. In auto correlation we expect to find a large peak at lag 7. Let’s find out
sys.path.append(os.path.abspath("../mlextra")) from daexp import * exp = DataExplorer() exp.addFileNumericData("sale.txt", 0, "sale") exp.getAutoCorr("sale", 20) output: == adding numeric columns from a file == done == getting auto correlation for data sets sale == result details: { 'autoCorr': array([ 1. , 0.5738174 , -0.20129608, -0.82667856, -0.82392299, -0.20331679, 0.56991343, 0.91427488, 0.5679168 , -0.20108015, -0.81710428, -0.8175842 , -0.20391004, 0.56864915, 0.90936982, 0.56528676, -0.20657182, -0.81111562, -0.81204275, -0.1970099 , 0.56175539]), 'confIntv': array([[ 1. , 1. ], [ 0.5118379 , 0.6357969 ], [-0.28111578, -0.12147637], [-0.90842511, -0.74493201], [-0.93316119, -0.71468479], [-0.33426918, -0.07236441], [ 0.43775398, 0.70207288], [ 0.77298956, 1.0555602 ], [ 0.40548625, 0.73034734], [-0.37096731, -0.03119298], [-0.98790327, -0.64630529], [-1.00279183, -0.63237657], [-0.40249873, -0.00532136], [ 0.36925779, 0.76804052], [ 0.70384298, 1.11489665], [ 0.34484471, 0.7857288 ], [-0.43251377, 0.01937013], [-1.03778192, -0.58444933], [-1.04959751, -0.57448798], [-0.44499878, 0.05097898], [ 0.313166 , 0.81034477]])}
As expected the largest peat is ay=t lag 0. The next largest peak is at lag 7 with a value 0.91427488.
Finally with the knowledge of seasonal period we can extract the time series components as below
#code same as in the last example exp.getTimeSeriesComponents("sale","additive", 7, True, False) output: == adding numeric columns from a file == done == extracting trend, cycle and residue components of time series for data sets sale == result details: { 'residueMean': 0.022420235699977295, 'residueStdDev': 19.14825253159541, 'seasonalAmp': 98.22786720321932, 'trendMean': 1004.9323081345215, 'trendSlope': -0.0048913825348870996}
The average value is in the trend mean. Trend has a small negative slope. Seasonality has amplitude of 98.227. Residue has mean and standard deviation. Because we set the 4th argument to True we got summary of the time series components. If it was False, the function would have returned the actual values of the 3 components.
Wrapping Up
We have gone through a python data exploration API with close to 70 functions. It should be easy to build an web application based on the API. Please refer to the tutorial document for more examples on how to use the API. Hope you find it useful. If you have any suggestion for new functions in the API , please let me know.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK