5

Typed Arrays from String Arrays for Dataset Operation

 3 years ago
source link: https://datacrayon.com/posts/programming/rust-notebooks/typed-arrays-from-string-arrays-for-dataset-operation/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Preamble

:dep darn = {version = "0.3.0"}
:dep ndarray = {version = "0.13.1"}
:dep ndarray-csv = {version = "0.4.1"}
:dep ureq = {version = "0.11.4"}
:dep plotly = {version = "0.4.0"}
extern crate csv;

use std::io::prelude::*;
use std::fs::*;
use ndarray::prelude::*;
use ndarray_csv::Array2Reader;
use std::str::FromStr;
use plotly::{Plot, Scatter, Layout};
use plotly::common::{Mode, Title};
use plotly::layout::{Axis};

Introduction

In this section, we're going to move from a raw dataset stored in a single string array (ndarray::Array2<String>) to multiple arrays of the desired type. This will enable us to use the appropriate operations for our different types of data. We will demonstrate our approach using the Iris Flower dataset.

Loading our Dataset

Before we move onto moving parts of our Iris Flower dataset into different typed arrays, we need to load it into our raw string array.

let file_name = "Iris.csv";

let res = ureq::get("https://datacrayon.com/datasets/Iris.csv").call().into_string()?;

let mut file = File::create(file_name)?;
file.write_all(res.as_bytes());
let mut rdr = csv::Reader::from_path(file_name)?;
remove_file(file_name)?;

let data: Array2<String> = rdr.deserialize_array2_dynamic().unwrap();
let mut headers : Vec<String> = Vec::new();

for element in rdr.headers()?.into_iter() {
        headers.push(String::from(element));
};

Let's display some rows from the string array to see if it;s loaded as expected.

darn::show_frame(&data, Some(&headers));
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species "1" "5.1" "3.5" "1.4" "0.2" "Iris-setosa" "2" "4.9" "3.0" "1.4" "0.2" "Iris-setosa" "3" "4.7" "3.2" "1.3" "0.2" "Iris-setosa" "4" "4.6" "3.1" "1.5" "0.2" "Iris-setosa" "5" "5.0" "3.6" "1.4" "0.2" "Iris-setosa" ... ... ... ... ... ... "146" "6.7" "3.0" "5.2" "2.3" "Iris-virginica" "147" "6.3" "2.5" "5.0" "1.9" "Iris-virginica" "148" "6.5" "3.0" "5.2" "2.0" "Iris-virginica" "149" "6.2" "3.4" "5.4" "2.3" "Iris-virginica" "150" "5.9" "3.0" "5.1" "1.8" "Iris-virginica"

Without digging deeper, it looks like we have the correct number of columns and rows.

Raw Dataset Dimensions

Once the data is loaded we may want to determine the number of samples (rows) and features (columns) in our dataset. We can get this information using the shape() function.

&data.shape()
[150, 6]

We can see that it's returned an array, where the first element indicates the number of rows and the second element indicates the number of columns. If you have prior knowledge of the dataset then this may be a good indicator as to whether your dataset has loaded correctly. This information will be useful later when we're initialising a new array.

Deciding on Data Types

Our dataset is currently loaded into a homogeneous array of strings, ndarray::Array2<String>. This data type has allowed us to load all our data in from a CSV file without prior knowledge of the suitable data types. However, it now means that we cannot apply pertinent operations depending on the feature. For example, we aren't able to easily determine any central tendencies for the SepealLenghCm column, or convert our units from centimetres to something else. All we can do right now is operate on these values as strings.

Let's see what happens if we try to operate on the data in its current form, e.g. if we want to find the mean average value of each column.

data.mean_axis(Axis(0)).unwrap()
data.mean_axis(Axis(0)).unwrap()
     ^^^^^^^^^ the trait `num_traits::identities::Zero` is not implemented for `std::string::String`
the trait bound `std::string::String: num_traits::identities::Zero` is not satisfied
data.mean_axis(Axis(0)).unwrap()
     ^^^^^^^^^ the trait `num_traits::cast::FromPrimitive` is not implemented for `std::string::String`
the trait bound `std::string::String: num_traits::cast::FromPrimitive` is not satisfied
data.mean_axis(Axis(0)).unwrap()
     ^^^^^^^^^ no implementation for `std::string::String / std::string::String`
cannot divide `std::string::String` by `std::string::String`

As expected, Rust is complaining about our data types. Let's look again at a summary of the dataset and make some decisions about our desired data types.

darn::show_frame(&data, Some(&headers));
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species "1" "5.1" "3.5" "1.4" "0.2" "Iris-setosa" "2" "4.9" "3.0" "1.4" "0.2" "Iris-setosa" "3" "4.7" "3.2" "1.3" "0.2" "Iris-setosa" "4" "4.6" "3.1" "1.5" "0.2" "Iris-setosa" "5" "5.0" "3.6" "1.4" "0.2" "Iris-setosa" ... ... ... ... ... ... "146" "6.7" "3.0" "5.2" "2.3" "Iris-virginica" "147" "6.3" "2.5" "5.0" "1.9" "Iris-virginica" "148" "6.5" "3.0" "5.2" "2.0" "Iris-virginica" "149" "6.2" "3.4" "5.4" "2.3" "Iris-virginica" "150" "5.9" "3.0" "5.1" "1.8" "Iris-virginica"

We can see that we have six columns, the names of which we have stored in our headers vector.

&headers
["Id", "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm", "Species"]

Let's go through them one-by-one and decide which data type will support our desired operations.

  • Id. This is an identifier that came from the original CSV file. We don't have much use for this column in our upcoming analyses, so in this case, we're going to drop this column.
  • SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm. This is the multivariate data that describes each flower sample with regards to the length and width of the sepals and petals. These are numerical values with fractional parts, so we may want to store them in a floating-point data type, e.g. f32.
  • Species. This column contains the true species of the flower samples. These are categorical values, so we may wish to convert them to numerical (integer) values, e.g. u32, or keep them as strings. We'll continue with the String type for now to keep things simple.

Moving Data to Typed Arrays

Once we've decided on what data types we want to employ we can move onto creating our typed arrays. This involves converting values from String to the desired type, and moving our data over to the new and typed arrays.

The Id Column

We've decided that we don't need this column, so it requires no action.

The SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm Columns

We've decided that we want these columns to have a data type of f32, so we need to convert and move them into a new homogeneous array, this time of ndarray::Array2<f32>. We're going to achieve this by using std::str::FromStr which gives us access to the from_str() function that allows us to parse values from strings.

Let's demonstrate this approach on the first of the four columns, SepalLengthCm. We'll dump the column to an output cell to see the before and after.

data.column(1)
["5.1", "4.9", "4.7", "4.6", "5.0", "5.4", "4.6", "5.0", "4.4", "4.9", "5.4", "4.8", "4.8", "4.3", "5.8", "5.7", "5.4", "5.1", "5.7", "5.1", "5.4", "5.1", "4.6", "5.1", "4.8", "5.0", "5.0", "5.2", "5.2", "4.7", "4.8", "5.4", "5.2", "5.5", "4.9", "5.0", "5.5", "4.9", "4.4", "5.1", "5.0", "4.5", "4.4", "5.0", "5.1", "4.8", "5.1", "4.6", "5.3", "5.0", "7.0", "6.4", "6.9", "5.5", "6.5", "5.7", "6.3", "4.9", "6.6", "5.2", "5.0", "5.9", "6.0", "6.1", "5.6", "6.7", "5.6", "5.8", "6.2", "5.6", "5.9", "6.1", "6.3", "6.1", "6.4", "6.6", "6.8", "6.7", "6.0", "5.7", "5.5", "5.5", "5.8", "6.0", "5.4", "6.0", "6.7", "6.3", "5.6", "5.5", "5.5", "6.1", "5.8", "5.0", "5.6", "5.7", "5.7", "6.2", "5.1", "5.7", "6.3", "5.8", "7.1", "6.3", "6.5", "7.6", "4.9", "7.3", "6.7", "7.2", "6.5", "6.4", "6.8", "5.7", "5.8", "6.4", "6.5", "7.7", "7.7", "6.0", "6.9", "5.6", "7.7", "6.3", "6.7", "7.2", "6.2", "6.1", "6.4", "7.2", "7.4", "7.9", "6.4", "6.3", "6.1", "7.7", "6.3", "6.4", "6.0", "6.9", "6.7", "6.9", "5.8", "6.8", "6.7", "6.7", "6.3", "6.5", "6.2", "5.9"], shape=[150], strides=[6], layout=Custom (0x0), const ndim=1

It's clear from the output of data.column(1) that every element is a string. Now let's use mapv() to go through every element of data.column(1), parse each value to f32, and return the new array.

data.column(1).mapv(|elem| f32::from_str(&elem).unwrap())
[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0, 7.0, 6.4, 6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5.0, 5.9, 6.0, 6.1, 5.6, 6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7, 6.0, 5.7, 5.5, 5.5, 5.8, 6.0, 5.4, 6.0, 6.7, 6.3, 5.6, 5.5, 5.5, 6.1, 5.8, 5.0, 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5, 7.7, 7.7, 6.0, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2, 7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6.0, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9], shape=[150], strides=[1], layout=CF (0x3), const ndim=1

Looking at the output of our operations we can see that we were successful in parsing our string values into numerical ones. Let's now use this approach to create a new array named data_features of type ndarray::Array2<f32>. We'll need to convert our one-dimensional arrays into two-dimensional column arrays using insert_axis(Axis(1)) as we stack them.

let features: Array2::<f32> = 
    ndarray::stack![Axis(1),
        data.column(1)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1)),
        data.column(2)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1)),
        data.column(3)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1)),
        data.column(4)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1))];

If we don't want to copy and paste the same line of code multiple times (which we don't) we can use a loop instead. First we need to create an array of integers that identify the column indices of the features that we want to convert.

let selected_features = [1, 2, 3, 4];

We can now iterate through the array of column indices, named selected_features, and stack each converted column into a new array of type Array2::<f32>.

let mut features: Array2::<f32> =  Array2::<f32>::zeros((data.shape()[0],0));

for &f in selected_features.iter() {
    features = ndarray::stack![Axis(1), features,
        data.column(f as usize)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1))];
};

Our headers vector (which describes 6 columns with 6 elements) doesn't broadcast onto our new features array (4 columns), so we'll create a new headers vector, feature_headers.

let feature_headers = headers[1..5].to_vec();

We now have our floating-point typed features (SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm) in a single array. Let's see how the data looks in its new form.

darn::show_frame(&features, Some(&feature_headers));
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5.0 3.6 1.4 0.2 ... ... ... ... 6.7 3.0 5.2 2.3 6.3 2.5 5.0 1.9 6.5 3.0 5.2 2.0 6.2 3.4 5.4 2.3 5.9 3.0 5.1 1.8

We're only seeing 10 samples array summary, so let's plot our features to get a better idea.

let layout = Layout::new()
    .xaxis(Axis::new().title(Title::new("Length (cm)")))
    .yaxis(Axis::new().title(Title::new("Width (cm)")));

let sepal = Scatter::new(features.column(0).to_vec(), features.column(1).to_vec())
    .name("Sepal")
    .mode(Mode::Markers);
let petal = Scatter::new(features.column(2).to_vec(), features.column(3).to_vec())
    .name("Petal")
    .mode(Mode::Markers);

let mut plot = Plot::new();

plot.set_layout(layout);
plot.add_trace(sepal);
plot.add_trace(petal);

darn::show_plot(plot);
1234567800.511.522.533.544.5SepalPetalLength (cm)Width (cm)

Let's also check to see that we can now calculate the mean average per column.

features.mean_axis(Axis(0)).unwrap()
[5.8433347, 3.054, 3.7586665, 1.1986669], shape=[4], strides=[1], layout=CF (0x3), const ndim=1

It appears to be operating as expected. You could validate the results by running the same operation in different software.

The Species Column

We've decided to keep these as strings. Let's index the Species column and store it in its own array named labels.

let labels: Array1::<String> = data.column(5).to_owned();

Finally, we'll check the first 10 elements of our new labels array.

labels.slice(s![0..10])
["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa"], shape=[10], strides=[1], layout=CF (0x3), const ndim=1

We could also use the itertools crate to apply some better presentation to this output.

:dep itertools = {version = "0.9.0"}
extern crate itertools;
use itertools::Itertools;

labels.slice(s![0..10]).iter().format("\n")
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"
"Iris-setosa"

Conclusion

In this section, we've demonstrated how to get parts of our raw string array into multiple arrays of various type. With this approach, we can now start operating on our data using appropriate operators for our analyses.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK