8

Lasagna plots in SAS: When spaghetti plots don't suffice

 1 year ago
source link: https://blogs.sas.com/content/iml/2016/06/08/lasagna-plot-in-sas.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Lasagna plots in SAS: When spaghetti plots don't suffice

16

Last week I discussed how to create spaghetti plots in SAS. A spaghetti plot is a type of line plot that contains many lines. Spaghetti plots are used in longitudinal studies to show trends among individual subjects, which can be patients, hospitals, companies, states, or countries. I showed ways to ease overplotting in spaghetti plots, but ultimately the plots live up to their names: When you display many individual curves the plot becomes a tangled heap of noodles that is hard to digest.

Lasagna plot of life expectancy

An alternative is to use a heat map instead of a line plot, as shown to the left. (This graph is created later in this article.) Each row of the heat map represents a subject. Each column represents a time point. Heat maps are useful when a response variable is recorded for every individual at the same set of uniformly spaced time points, such as daily, monthly, or yearly.

In a cleverly titled paper, Swihart et al. (2010) proposed the name "lasagna plot" to denote a heat map that visualizes longitudinal data. Whereas the spaghetti plot allows "noodles" (the individual curves) to cross, a lasagna plot layers the noodles horizontally, one on top of another. A related visualization is a calendar chart (Mintz, Fitz-Simons, and Wayland (1997); Allison and Zdeb (2005)), which also uses colored tiles to convey information about a response variable as a function of time.

This article shows how to create lasagna plots in SAS. To create a lasagna plot in SAS you can:

You can download the SAS program used to create all the graphs in this article.

Create a lasagna plot in #SAS, because sometimes spaghetti plots don't satisfy. #DataViz Click To Tweet

Create a basic lasagna plot in SAS

In a previous article I showed how to download the World Bank data for the average life expectancy in more than 200 countries during the years 1960–2014. After downloading the data, the data are transformed from "wide form" into "long form." The following call to PROC SGPLOT creates a spaghetti plot of the "Low Income" countries.

ods graphics / imagemap=ON;   /* enable data tips */
title "Life Expectancy at Birth";
title2 "Low-Income Countries";
proc sgplot data=LE;          /* create conventional spaghetti plot */
   where income=5;            /* extract the "low income" companies */
   format Country_Name $10.;  /* truncate country names */
   series x=Year y=Expected / group=Country_name break curvelabel
       lineattrs=(pattern=solid) tip=(Country_Name Region Year Expected);
run;

Spaghetti plot of life expectancy

This spaghetti plot is not very enlightening. There are 31 curves in the graph, although discovering that number from the graph is not easy. The labels that identify the curves overlap and are impossible to read. You can see some trends, such as the fact that life expectancy has, on average, increased for these countries. You can also see some interesting features. Cambodia (a reddish color) experienced a dip in the 1970s. Rwanda (purple) and Sierra Leone (gold) experienced dips in the 1990s. Zimbabwe (light blue) experienced a big decline in the 2000s.

A lasagna plot visualizes these data more effectively. The following statements use the HEATMAP statement in PROC SGPLOT, which requires SAS 9.40M3:

title "Life Expectancy in Low Income Countries";
/* 1. Unsorted list of low-income countries */
ods graphics/ width=500px height=600px discretemax=10000;
proc sgplot data=LE;
   where Income=5;            /* extract the "low income" companies */
   format Country_Name $10.;  /* truncate country names */
   heatmap x=Year y=Country_Name/ colorresponse=Expected discretex
               colormodel=TwoColorRamp;
   yaxis display=(nolabel) labelattrs=(size=6pt) reverse;
   xaxis display=(nolabel) labelattrs=(size=8pt) fitpolicy=thin;
run;

The graph is shown at the top of this article. Each row is a country; each column is a year. The default two-color color ramp encodes the value of the response variable, which is the average life expectancy. There are 31 x 55 = 1705 tiles, so the image displays a lot of information without any overplotting. You can use the COLORMODEL= option on the HEATMAP statement to specify a custom color ramp.

Many rows have a light shade on the left and a darker shade to the right, which confirms the general upward trend of the response variable. Countries that experienced a period of decline in life expectancy (Cambodia, Rwanda, and Zimbabwe) have a region of white or light blue in the middle of the row. In some countries the life expectancy has been consistently low (Chad and Guinea-Bissau); in others, it has been consistently high (Dem. People's Republic of Korea).

The lasagna plot is not perfect. Because the graph avoids overplotting, you need a lot of space to display the rows. This lasagna plot uses 600 vertical pixels to display 31 countries. That's about 16 pixels for each row after you subtract the space above and below the heat map. If you use a smaller font, you can reduce that to 10 or 12 pixels per row. However, even at 12 pixels per row, you would need about 2500 pixels in the vertical direction to display all 207 countries in the data set. In contrast, the spaghetti plot displays an arbitrary number of (possibly undecipherable) lines in a smaller area.

Sorting rows of a lasagna plot

Alphabetical ordering is often not the most informative way to display the levels of a categorical variable. You can sort the countries in various ways: by the mean life expectancy, by the average rate of increase (slope), by geographical region, and so forth.

To demonstrate sorting, the following program uses the HEATMAPCONT subroutine in the SAS/IML language. The following statements read in data for 51 lower-middle income countries into a SAS/IML matrix. The statements read the original "wide" data whereas PROC SGPLOT required "long" data. Use the PALETTE function to create a custom color ramp for the heat map.

ods graphics / width=500px height=650px;
proc iml;
varName = "Y1960":"Y2014";
use LL2 where (Income=4);               /* read "lower-middle income" countries */
   read all var varName into X[rowname=Country_Name]; /* X = "wide" data matrix */
close LL2;
 
Names = putc(Country_Name, "$15.");            /* truncate names */
palette = "CXFFFFFF" || palette("YLORRD", 4);  /* palette from colorbrewer.org */
 
/* 2. Order rows by average life expectancy from 1960-2014 */
mean = X[,:];                   /* compute mean for each row */
call sortndx(idx, mean, 1, 1);  /* sort by first column, descending */
Sort1 = X[idx,];                /* use sorted data */
Names1 = Names[idx,];           /* and sorted names */
call heatmapcont(Sort1) xvalues=1960:2014 yvalues=Names1
                    displayoutlines=0 colorramp=palette
                    title="Life Expectancy Sorted by Average";

Lasagna plot of life expectancy. Created with SAS/IML.

The lasagna plot show the life expectancy for 51 countries. The countries are sorted according to mean life expectancy over the 55 years in the data.

The sorted visualization is superior to the unsorted version. You can easily pick out the countries that have the top life expectancy, such as former Soviet Union countries. you can easily see that the countries at the bottom of the list are mainly in Western and Southern Africa.

The countries that experienced dips in life expectancy contain a patch of white or pale yellow in the middle of the row. Zambia, Kenya, and Lesotho stand out. Countries that have dramatically improved their life expectancy are also evident. For example, the rows for Bhutan and Timor-Leste are white or yellow on the left and dark orange on the right, which indicates that life expectancy has greatly improved in the past 55 years for these countries.

In this heat map, missing values are assigned a gray color. Only two countries (West Bank and Gaza, Kosovo) have missing values for life expectancy.

Other ways to sort lasagna plots

Swihart et al (2010) discuss other sorting techniques. One alternate display is to sort each column independently. This is similar to displaying a sequence of box plots, one for each year, because it shows the distribution of the response variable for each year, aggregated over all countries. This display is accomplished by using the following SAS/IML statements to sort each column in the data matrix:

/* 3. Order each year to see distribution of life expectancy for each year */
Sort2 = X;                   /* copy original data */
do i = 1 to ncol(X);
   v = X[,i];                /* extract i_th column */
   call sort(v, 1, 1);       /* sort i_th column descending */
   Sort2[,i] = v;            /* put sorted column into matrix */ 
end;
call heatmapcont(Sort2) xvalues=1960:2014
                    displayoutlines=0 colorramp=palette
                    title="Life Expectancy Sorted for each Year";

Lasagna plot sorted by mean response. Created with SAS/IML.

In this graph, each tile still represents a country, but you don't know which country. The purpose of the graph is to show how the distribution of the response variable has changed over time. You can see that in 1960 most countries had a life expectancy of 55 years or less, as shown by the vertical strip of mostly yellow tiles for 1960. Only one country in 1960 had an average life expectancy that approached 70 years (red).

Decades later, the vertical strips contain fewer yellow tiles and more orange and red tiles. The strip for 1990 contains mainly orange or dark orange tiles, which indicates that most countries have an average life expectancy of 55 years or greater. For 2014, more than half of the vertical strip is dark orange or red, which indicates that most countries have an average life expectancy of 65 years or greater.

Summary

A "lasagna plot" is a heat map in which each row represents a subject such as a country, a patient, or a company. The columns usually represent time, and the colors for each tile indicate the value of a response variable. The lasagna plot is useful when the response is measured for all subjects at the same set of uniformly spaced time points.

In contrast to spaghetti plots, there is no overplotting in a lasagna plot. However, each subject requires some minimal number of vertical pixels (usually 10 or more) so lasagna plots are not suitable for visualizing thousands of subjects. Lasagna plots are most effective when you sort the subjects by some characteristic, such as the mean response value.

Much more can be said about lasagna plots. I recommend the paper "A Method for Visualizing Multivariate Time Series Data," (Peng, JSS, 2008), which discusses techniques for standardizing the data, smoothing the data, and discretizing the response variable. Peng's article includes panel displays, which can be created in SAS by using the GTL and PROC SGRENDER.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK