32

The ultimate EDA visualization in R

 5 years ago
source link: https://www.tuicool.com/articles/zEnQrm7
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

The ultimate EDA visualization in R

Sep 3 ·7min read

Introduction

Recently I was doing EDA for a research project. My colleague introduced a plot to me called rain cloud plot. It is a combination of dot plot, box plot and kernel density (or you could think of it as a half violin plot), which really conveys a myriad of information yet in a visually pleasant way (thus a lot of times we mistakenly call it rainbow plot :) ). A sample of the plot can be seen below:

mQNviia.png!web

Rain cloud plot for a comparative politics project

It looks amazing, isn’t it? I was astonished by the eventual look of the plot even though I made it!

Please note that this post is built upon the great work by Paula Andrea Martinez , Dr. Micah Allen , and David Robinson . Thus I will not reinvent the wheel by presenting all the details to make the plot; but rather, I will introduce some possible tricks you might need in order to make such plot suits your needs.

Preparation

I will use the dataset from the World Value Survey Wave 5 as an example. I have cleaned the dataset which you could simply git clone from here . A glimpse of the dataset is here:

aEFjMjv.png!web

The country column has 22 unique country values:

Argentina, Australia, Brazil, Bulgaria, Canada, Chile, Finland, France, Georgia, Hungary, Japan, Mexico, Netherlands, Norway, Poland, Romania, Serbia, Slovenia, Sweden, United Kingdom, United States, and Uruguay.

The rest of the columns correspond to respondents’ answers to 6 questions related to the essential characteristics of democracy, on a 1–10 scale. For example, for each variable above, there was a question prompt like this:

Many things may be desirable, but not all of them are essential characteristics of democracy. Please tell me for each of the following things how essential you think it is as a characteristic of democracy.

(1) Government taxes rich to subsidize poor;

(2) Religious authorities interpret laws;

Use this scale where 1 means ‘not at all an essential characteristic of democracy’ and 10 means it definitely is ‘an essential characteristic of democracy.’

Another thing you need in order to create the plot is the R package RColorBrewer .

RColorBrewer is an R package that allows users to create colourful graphs with pre-made color palettes that visualize data in a clear and distinguishable manner. There are 3 categories of palettes: qualitative, diverging, and sequential.

Problem 1

The code sample provided in the above links would work for most of the times if you do not have a large dataset featuring many groups (in this case, country). However, it will certainly break when you have “many” groups. The cutoff value is around 8–12 depending on the specific color palette you choose and put in the following code.

# borrowed from Paula Andrea Martinez's post mentioned above
g <-
ggplot(data = name_of_your_data,
aes(x = EmotionCondition, y = Sensitivity, fill = EmotionCondition)) +
geom_flat_violin(position = position_nudge(x = .2, y = 0), alpha = .8) +
geom_point(aes(y = Sensitivity, color = EmotionCondition),
position = position_jitter(width = .15), size = .5, alpha = 0.8) +
geom_point(data = sumld, aes(x = EmotionCondition, y = mean),
position = position_nudge(x = 0.3), size = 2.5) +
geom_errorbar(data = sumld, aes(ymin = lower, ymax = upper, y = mean),
position = position_nudge(x = 0.3), width = 0) +
expand_limits(x = 5.25) +
guides(fill = FALSE) +
guides(color = FALSE) +
coord_flip() +
scale_color_brewer(palette = "Spectral") +
scale_fill_brewer(palette = "Spectral") +

theme_bw() +
raincloud_theme

Thus if you have a dataset like our case example here, which a lot of times you will have in the real analytical settings, the above R code would break and yield that obnoxious error:

> g
Error: Insufficient values in manual scale. 22 needed but only 1 provided.

You will see this because ggplot could not fetch enough colors from the color palette we chose (“Spectral” in this case) for our 22 groups/countries. However, all of the default palettes only featured 8–12 colors. Thus we need to “cut” the palette into smaller intervals thus creating more colors to fit our groups. We could do so by manually “cut” the default palette and create a new palette through the colorRampPalette function

getPalette = colorRampPalette(RColorBrewer::brewer.pal(8, "Set2"))(22)  # I set (26) when creating the plot above for better transition of colors.

This will cut our original palette intro 22 small intervals and extract the corresponding color. I have tried some of the default palettes and I like the palette “Set2” the most. I actually created 26 colors rather than 22 because this would allow the last few groups to have colors that are brighter and more saturated.

Have a try on different palettes and combination of colors that please you the most :)

Problem 2

This is a more subtle problem and it will only be evident if you have skewed data. The reason it is subtle is that the code will not complain about anything and generate the fancifully colorful plots as you requested. However, after you put it into production and report you may be asked “why don’t they look like they are on the same scale?”

Let me show you an example. Please compare the following figures with tax as the plotted variable in the first figure and civil_rights in the second.

qUVBjiZ.png!web

Figure 1

Figure 1 looks as expected and we see a nice depiction of the spread of our data.

BnEruqI.png!web

Figure 2

Figure 2 looks bizarre, isn’t it? The density plot looks thin and squashed and it seems that they are not on the same scale as figure 1. But let me tell you that the code to generate both figures is exactly the same with no error at all!

The problem lies in the distribution of the data and the specificities of the density plot. The civil_rights variable in figure 2 is way more skewed towards the top than the tax variable, meaning that more people think protection of civil rights is definitely an essential characteristic of democracy; than they think that Government taxes rich to subsidize poor as a definitely essential characteristic of democracy. The median for certain country in the civil_rights variable is almost 9. This much skewness directly causes the density estimation of the pdf shoot out of the range of 10 and extends to even 12! However, the rain cloud plot default to cap/trim the density at 10 thus leading to the loss of area of densities out of the range of 10, causing the density plot to look like it is squashed and compressed.

So far there is no good solution to append the loss of area to the top value 10. However, after researching a bit I found there is a parameter in the raincloud plot that will help our interpretation of the plot.

g <- 
ggplot(data = name_of_your_data,
aes(x = factor(country), y = tax, fill = factor(country))) +
geom_flat_violin(position = position_nudge(x = .2, y = 0), trim = TRUE, alpha = .8, scale = "width") +
geom_point(aes(y = tax, color = factor(country)),
position = position_jitter(width = .15), size = .5, alpha = 0.8) +
geom_boxplot(width = .1, outlier.shape = NA, alpha = 0.5) +
geom_point(data = sumld, aes(x = factor(country), y = mean),
position = position_nudge(x = 0.3), size = 2.5) +
geom_errorbar(data = sumld, aes(ymin = lower, ymax = upper, y = mean),
position = position_nudge(x = 0.3), width = 0)+
expand_limits(x = 5.25) +
guides(fill = FALSE) +
guides(color = FALSE) +
scale_color_manual(values = getPalette) +
scale_fill_manual(values = getPalette) +
#coord_flip() + # flip or not
theme_bw() +
raincloud_theme +
theme(axis.title = element_text(size = 42),
axis.text=element_text(size=42))

The key here is the parameter scale = "width" . This parameter is documented in here . And according to this doc:

if “area” (default), all violins have the same area (before trimming the tails). If “count”, areas are scaled proportionally to the number of observations. If “width”, all violins have the same maximum width.

If we set scale = "width" this is what those two figures would look like:

yIziuun.png!web

Figure 3: Adjusted plot for the tax variable

fuiAJv3.png!web

Figure 4: Adjusted plot for the civil_rights variable

We could see the plots, especially figure 4, are no longer as compressed as figure 2, and the distribution of the variable WITHIN each country becomes more evident. This is exactly what we want here, to understand the distribution WITHIN each country. One caveat here is that since we configure the largest density width to be the same, the area of the densities is not the same (even though it wasn’t in figure 2 given the trimmed-out area is not constant across countries). This drawback is especially evident in Sweden in figure 4. However, given our goal is to view the distribution WITHIN not BETWEEN countries, I think the adjusted version works better for us here.

Conclusion

I hope this article could clear the obstacles you may encounter when you try to make the amazing rain cloud plot. I could envision a broader presence of it in academia and industry for its elegance in condensing tons of information.

A final note is that the implementation of rain cloud plot in Python and Matlab can also be found in here .

I hope you find this article useful in your life. Please feel free to let me know if you have further questions related to rain cloud plot :)

Happy coding!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK