How is a data set for machine learning?

1. Definition

A machine learning dataset collects data needed to create and train an approximation, classification, or forecasting model.

A data source is the location where the data being used originates. We can have different types of data sources as excel files, .csv files, databases, image data, etc.

Before building a model, it is necessary to transform the data into numbers, i.e., we have to collect the data in a matrix of real numbers, creating the data matrix.

Every column represents a particular variable, and each row corresponds to a given sample of the data set in question.

A variable is any characteristic, number, or quantity that can be measured or counted. It is an attribute that describes a person, place, thing, or idea.

The variable’s value can "vary" from one entity to another. According to their type, we can consider different types of variables: numeric, ordinal, binary, or categorical variables.

Variables can be used as inputs or targets. Input variables are the independent variables in the model (they are also called features or attributes), and target variables are the dependent variables in the model.

A sample is an observation of all variables. The samples will also have different uses. We divide the samples into three different subsets. These are the training set (used to build different candidate models), the selection set (used to select the model that exhibits the best properties), and the test set (used to validate the final model).

Sometimes, the data set may be incomplete or have missing values. This is one of the main problems when applying neural networks to real-world problems. We can unuse the whole sample or impute such a value to solve it.

2. Data analysis

Before building a model, we need to analyze the data statistically to understand what it represents.

The most basic analysis is the statistics for each variable, and the most important statistical parameters are the minimum, maximum, mean, and standard deviation.

Another descriptive analysis is that of the distributions of each variable. For the predictive model to be of higher quality, we must check that all the variables in a data set have a uniform or normal distribution.

We can calculate different types of distributions, such as histograms, pie charts, medians, quartiles, or box plots.

We can also discover dependencies between the variables of the data set from the correlations.

A correlation is a numerical value between -1 and 1 that expresses the strength of the relationship between two variables. If the correlation is close to 1 between two variables, they are positively related; if it is close to 0, the study variables are unrelated; and if the correlation is close to -1, the variables are negatively related.

We can also analyze the data to detect potential problems. One of the most common problems is outliers.

Outliers are observations in the data that are abnormally outliers and can spoil and confound the training process.

We can use Tukey’s test, a univariate method, or the Local outlier factor, a multivariate method, to deal with outliers.

We can also filter the data to create models with subsets of them. Filtering is usually temporary, we keep the entire data set, but only a part is used for the calculation.

Filtering requires that you specify a rule or logic to identify the cases you want to include in the analysis.

On the other hand, it is always convenient to scale the variables to order zero before training a neural network.

The objective of data scaling is to convert the data into an appropriate range for its computation. Data scaling is generally performed variable-by-variable, as different variables may require different types of scaling.

In this way, some of the most used scaling methods are the minimum and maximum, the mean and standard deviation, and the logarithm.

Training algorithms for neural networks do not work with the data matrix directly. Instead, they use data structures called data batches.

A batch of data contains two tensors, one with input data and one with target data, and the range of these batches depends on the model type.

3. Data matrix

Before building a model, we need to collect the data in a matrix of real numbers.

Let denote $ p$ the number of rows and $q$ the number of columns. It is a matrix $d \in {R}^{p \times q}$.

As we can see, machine learning models require all data to be real numbers.

The data matrix has the following form,

\begin{eqnarray} d = \left( \begin{array}{ccc} d_{1,1} & \cdots & d_{1,q}\\ \vdots & \ddots & \vdots \\ d_{p,1} & \cdots & d_{p,q}\\ \end{array} \right). \end{eqnarray}

A sample is a vector $u \in {R}^{p}$, where $p$ is the number of rows in the data matrix. In this regard, the data matrix contains $q$ variables,

\begin{eqnarray} u_{i}:=col_{i}(d), \quad i=1,\ldots,q. \end{eqnarray}

The samples will also have different uses. We divide the samples into three different subsets. These are the training set (used to build different candidate models), the selection set (used to select the model that exhibits the best properties), and the test set (used to validate the final model).

A variable is a vector $v \in {R}^{q}$, where $q$ is the number of columns in the data matrix. In this regard, the data matrix contains $p$ samples,

\begin{eqnarray} v_{i}:=row_{i}(d), \quad i=1,\ldots,p. \end{eqnarray}

The variable’s value can "vary" from one entity to another. According to their type, we can consider different types of variables: numeric, ordinal, binary, or categorical variables.

Our source of information might not be directly in the format of a matrix.

For example, the information may be distributed in several tables of a database.

We can also find sets of images to, for example, diagnose a tumor.

In addition, some data may not be real numbers.

For example, a customer's country (Spain, France, etc.) is categorical.

This means that an essential part of building machine learning models is the creation of a data matrix with the correct format.

The following example from the industry sector shows a data matrix of a real model: Wind turbine data matrix

A wind turbine manufacturer wants to know the electrical power generated by the device at different wind speeds. To do this, they measure different operating scenarios and generate the following data matrix,

\begin{eqnarray}\nonumber d = \left( \begin{array}{cc} 380.048 & 5.311\\ 453.769 & 5.672\\ \vdots & \vdots \\ 2820.466 & 9.973\\ \end{array} \right). \end{eqnarray}

The number of columns in the data matrix is $q=2$, a simple matrix. Each column corresponds to a variable. In this case, we have the wind speed (in meters per second) and the corresponding power generated by the turbine (in kilowatts). The first column is the input and the second column the target.

The number of rows in the data matrix is $p=48007$. Each row corresponds to a sample. Each sample contains values of the two variables.

Conclusions

Datasets collect the data needed to create and train a model. In general, the data must be transformed to adapt it to machine learning and create the data matrix. Subsequently, it is advisable to perform a statistical study of the data to deal with potential problems such as outliers.

1. Definition

2. Data analysis

3. Data matrix

Conclusions

Recommend

Hackers leak over 1 million credit cards to the dark web

Price Dropped: The 2022 Complete Digital Copywriting Master Class Bundle

Process Mapping for Seamless Software Testing - DZone Agile

10 Ways That Crypto Like Sweatcoin, Impt & Tama Are Making the World a Bette...

痛苦的调休，今天到底是星期几？

新版发布及鸿蒙设备Python网络编程简介

搭建商城平台盈利的模式有哪些？

Apple’s most popular iPad is now available for just $269

Redesign Health lays off 67 employees a month after raising $65 millio

MoveGenius’s secret weapon is BSV blockchain securing data in the background

About Joyk