Radial Stacked Bar Chart for the average life expectancy across the US states

By Mike Bostock via Observable (GPL-3.0-only)

Why interactive visualization is important

We perceive most of the information visually. We have the capability to instantly recognize and localize objects. We group them in the same category regardless of their shape, size, color, distance, cluttered or not. Most of the time, in order to understand some concepts it is always easier to make a sketch. This is due to the immense sophistication of the human visual cortex. That means it is in our nature to comfortably consume and understand visuals.

The same principles hold when we interpret tabular and highly-structured data. We draw 2D or 3D plots, histograms, scatter plots, heat maps, etc. We do this because it is not straightforward to understand the tabular representation of the data. On top of this, to make the experience more immersive, we make the plots interactive. However, we are cursed to perceive the information up to 3 dimensions. Representing data with more than 3 dimensions is a challenging task and we need different strategies.

For this reason, in this blog post, I will show techniques on how to escape the three dimensions married with interactivity for even better perception, by using D3. Stay tuned!

The visualization zoo

We can better understand the world and transfer a clear message by plotting the data we have because a plot is the most convenient and intuitive means to do that. In addition, we can easily upgrade the plots and make them interactive. In this way, the human-computer interaction is more immersive and the results are more interpretable. For instance, just take a look at the amazing interactive plots from Our World in Data. With a simple and yet non-biased and profound analysis and interactive visualization of many open-access data sets, we can clearly comprehend a plethora of the world’s phenomena.

Depending on the aspects and the findings to express, we use different types of plots. The most commonly used plots are nicely summarized in the paper A Tour Through the Visualization Zoo [1]. Generally, they are divided into five categories: Time-Series, Statistical Distributions, Maps, Hierarchies, and Networks. To find out the most recent developments and application of these visualization techniques, follow the work of the Interactive Data Lab at the University of Washington.

Fig. 1. Screenshot from the D3 gallery

There are plenty of tools to generate interactive visualizations. They range from very low-level, customizable and hard to automate tools to high-level automatic tools. In the open-source domain, we have the well established low-level JavaScript library D3. The main advantage of D3 is that we can create very specific and custom plots. Furthermore, we have more high-level visualization tools, Vega and Vega-Lite, both open-source. Vega is built on top of D3 and its aim is to provide a better and faster way to create high-quality graphics only by using JSON syntax. Vega-Lite is even more lightweight and enables quick production of the common statistical plots. On top of this ecosystem, is Voyager, a tool that automates the generation of interactive charts. Voyager is inspired by Tableau, a commercialized and industry adopted tool. This list is not exhaustive of course, there are numerous other tools and libraries, which I include in the Appendix.

Plotting high-dimensional data

Plotting and understanding 2D and 3D data are simple since we live in a three-dimensional world. For this purpose, we use histograms to plot distributions and line charts to draw relations. Reaching more than 3D is difficult. For this reason, we might use different shapes, sizes, colors, and text if some of the data is discrete and finite. For instance, we can use a Bubble Chart, like the one depicted in the figure below.

Bubble chart for the CO2 emissions per capita vs GDP per capita across the countries in the world

Fig. 2. By Hannah Ritchie and Max Roser via Our World in Data (CC BY 4.0).

In fact, we have a five-dimensional data consisting of two continuous dimensions (GDP and CO2 emissions per capita) and three categorical dimensions (country, population, and geo-location). With a simple trick, we represent each country with a circle labeled with the country name, the total population with the size of the diameter, and the geo-location with a color schema.

So far so good! However, this type of plotting has its limitations. We can’t represent a considerably high number of dimensions by using these tricks, especially if they are all continuous. For this reason, we have to use some other means to communicate the data. As mentioned in the paper A Tour Through the Visualization Zoo, we can use a plot called Parallel Coordinates.

The Parallel Coordinates plot enables us to explore and find patterns in high-dimensional data. Each dimension is represented as a vertical line parallel to the others where the range of values is distributed. Thus, one point in the high-dimensional space is represented as a poly-line connecting the corresponding values on the parallel axes. Additionally, we can select a subset of values in one or multiple axes to filter out the entries. An example is shown in the figure below, which is a visualization of car characteristics.

Parallel Coordinates plot representing the car prices

Fig. 3. By Jason Davies via Blocks (GPL-3.0-only)

Hands-on Parallel Coordinates with D3

The main goal of this post is to demonstrate the effectiveness of the interactive visualization, in particular the Parallel Coordinates plot. For this reason, we will show how to give a visual interpretation of a given problem.

Problem Statement

Suppose we want to buy a laptop with certain characteristics that well suits our needs. Usually, we do not have a clue about all the possible options. To read the specifications and compare the different offers, we might get a huge table that resembles like the table below:

A table summarizing different laptops and their prices

Table 1: Tabular representation of the Laptop Prices data set.

Although it is nicely formatted and summarizes the laptops neatly, we might easily get lost once we scale to a few hundred rows. There are 12 different axes to compare at the same time which poses a major problem to keep track of. Instead, we can visualize all entries in the table using theParallel Coordinates plot and query them interactively.

For the purpose of the demo, we use the Laptop Prices data set from Kaggle which you can find it here. It contains 1300 entries with the following 13 columns: Company, Product, Type, Inches, Screen Resolution, CPU, RAM, Memory, GPU, Operating System, Weight, Price.

First, we pre-process the data set in order to clean it and adjust it for plotting, using this Jupiter Notebook. We split the column Memory into two columns SSD Memory and HDD Memory and express the quantities only in Gigabytes. Similarly, we split the column CPU into two other columns, CPU Model Name and CPU Clock Rate. The former contains the model of the CPU, while the latter its clock rate expressed in GHz. For the GPU column, we only take the part containing the model name. Furthermore, the values in the column named Ram contain the suffix GB, which is redundant, thus we remove it. Finally, for the Screen Resolution column, we only take the part containing the screen resolution in pixels, discarding the rest of the description. In the end, the data set format is like the one shown in Table 1.

Using the transformed data set we create the Parallel Coordinates plot. For convenience reasons, we only use 10 columns, assigning each one a vertical axis in the plot. Every axis has a description on top of it and a range of values depending on the type of data it holds 1. numeric or 2. string. In the case of numeric data, the range is from the minimum to the maximum, separated with equidistant ticks ascending from bottom to top. In the case of strings, the range is from the first to the last alphabetically sorted string, each represented with one tick. Thus, one laptop is fully specified with one poly-line stretching from the most left to the most right axis. The poly-lines are colored according to the Company value in order to distinguish them easily.

The plot is interactive, such that, on each axis, we can select a subset of values in which we are interested, by dragging the mouse pointer over it. In addition, we can slide this range over the axis. Automatically, not selected values are filtered out. To augment the search, the table below the plot dynamically updates with the 5 cheapest laptops in the current selection. To restart the selection, click out of the selected range.

Now, the laptop search is much easier and intuitive. We can set different constraints for every column and notice the changes and compare the fewer options. Below you can find a video illustration of how it works. To try it out yourself, please visit the original blog post.

Demo animation of the parallel coordinates plot

Demo of the Parallel Coordinates plot with a dynamic table

The main advantage of this plot is that we can intuitively understand and plot multi-dimensional data. However, if the number of dimensions is very big, we can’t plot all of them, since the plot will be too condensed and confusing. Moreover, it is not well suited for categorical data. The number of categories must be sufficiently small in order to place them all on one of the axes.

The full code to run this interactive visualization can be found here. For more information follow me on Twitter.

Conclusion

In this blog post, we understand the importance of interactive visualization to understand our data better and make our searches faster. There are many different types of plots depending on the data nature and the task to perform. One of them is the Parallel Coordinates plot that enables us to scale easily to more than 3 dimensions. Out of the many existing tools for plotting, we use D3 for a hands-on experience to create a Parallel Coordinates interactive plot augmented with a dynamic table in the laptop searching domain. This plot has many potential uses and next time we will see how to apply in the domain of Machine Learning.

Appendix

JavaScript libraries

The following list includes popular open-source JavaScript libraries and tools for visualization

Chart JS: it offers 8 fully responsive charts for mobile and web developers
Crossfilter: a library optimized for loading and exploring huge data sets, with millions of entries
DC.js: used to create multiple charts on the same data set that update dynamically together
Plotly.js: built on top of D3, it offers more than 40 charts
Cola.js: for plotting graph-based data
Leaflet: for plotting mobile-friendly interactive maps
MetricsGraphics.js: optimized for plotting time-series data

Python Libraries

The following list includes popular open-source Python libraries and tools for visualization

Matplotlib: mainly and mostly used for scientific plotting
Seaborn: based on Matplotlib for drawing more appealing chars
Bokeh: interactive visualizations mainly used for web applications
Altair: a Python variant of the Vega-Lite visualization grammar
Plotly: the Python wrapper of Plotly.js

Learning Resources

The following list contains an awesome set of resources to learn to plot with D3:

References

[1]Jeffrey Heer, Michael Bostock, Vadim Ogievetsky, “ A Tour Through the Visualization Zoo” (2010), Communications of the ACM, Vol.53, No. 6

JavaScript Interactive Visualization | Analytics Vidhya

Why interactive visualization is important

The visualization zoo

Plotting high-dimensional data

Hands-on Parallel Coordinates with D3

Problem Statement

Conclusion

Appendix

JavaScript libraries

Python Libraries

Learning Resources

References

Recommend

Top 5 Beautiful Soup Functions That Will Make Your Life Easier

A Commodore 64 Skin for Windows Terminal

Amazon Elastic Container Service now supports Amazon EFS file systems

中国企业已经丢了全球云市场份额

Hypermedia APIs in Play! framework with the blackdoor hate library

A brief apology of Ok-Wrapping

Tools for Consistent JavaScript Code Style

Pydeps: A Useful Program

Comparing Cloud Native Buildpacks to Herokuish

A Galaxy for Everyone: Awesome New Galaxy A Series Coming to the US - Samsung US...

About Joyk