Guides/Tips

A collection of guides/tutorials and shorter ideas, mainly on analytics and data visualization.

Comparing Survey Results with Alluvial Diagrams

BreakfastAlluvialDiagram copy.png

Alluvial Diagrams

Introduction

Survey results in the raw can be useful, but for some projects it can be helpful to see the connection between how each user answers multiple questions. Let’s imagine we asked a group of people two questions about breakfast options— how much one likes eggs and how much one likes ham:

How interested are you in eggs?

Count of Responses

(1) Not interested

4

(2) Not very interested

7

(3) Somewhat interested

6

(4) Pretty interested

10

(5) Quite interested

4

How interested are you in eggs_.png

How interested are you in ham?

Count of Responses

(1) Not interested

5

(2) Not very interested

6

(3) Somewhat interested

8

(4) Pretty interested

7

(5) Quite interested

5

How interested are you in ham.png

These two charts are fairly similar, but just looking at this data, we don’t know about any connection between the answers. Here are some questions we currently can’t answer:

  1. Do people who like eggs also like ham?

  2. Do people who like eggs hate ham? (or vice versa)

Ham and eggs are a silly example so the importance may not be immediately obvious, but consider if this data were instead for a few app features — it would be important to know if, say, users had the same feeling about being contacted through either push notifications or email digests, or if the email lovers tended to hate push notifications.

One way to visualize the correlations between the survey answers is with the style of Sankey diagram known as an alluvial diagram. If the survey results come from a survey program which assigns anonymous userIDs to survey respondents (many do), then we can group the data based on each user’s answer and determine the relationships between how each person answers a series of questions.

For the charts in this post we’ll use R, with the addition of the ggalluvial library (plus the general tidyverse components). In order to facilitate that, it’ll be best to have the data in three columns:

  • one column for the Eggs ranking (out of 5)

  • one column for the corresponding Ham ranking

  • one for the frequency at which this Eggs/Ham pair of ranks appears in the data (e.g. “How many times did someone rate eggs as 1/5 and rate ham as 5/5?”)

If you’re good with R, you can probably use dplyr or something similar to transform the data; if you’re not as familiar with R and your data set is relatively small, you could just do it by hand if necessary. I don’t have a great tutorial for this because each survey program I’ve worked with seems to have its own way of reporting user-based response statistics. Let’s assume that our data from above looks like the following table when transformed.

(Recall that “freq” indicates how many times this particular combination of answers appeared in the dataset)

With R’s ggalluvial library, we can use this to create a chart.

Eggs

Ham

freq

5 (Quite interested)

1 (Not interested)

3

5 (Quite interested)

2 (Not very interested)

1

4 (Pretty interested)

1 (Not interested)

1

4 (Pretty interested)

2 (Not very interested)

4

4 (Pretty interested)

3 (Somewhat interested)

5

4 (Pretty interested)

5 (Quite interested)

1

3 (Somewhat interested)

2 (Not very interested)

1

3 (Somewhat interested)

3 (Somewhat interested)

2

3 (Somewhat interested)

4 (Pretty interested)

3

2 (Not very interested)

5 (Quite interested)

1

2 (Not very interested)

3 (Somewhat interested)

1

2 (Not very interested)

4 (Pretty interested)

3

2 (Not very interested)

5 (Quite interested)

2

1 (Not interested)

4 (Pretty interested)

1

1 (Not interested)

5 (Quite interested)

2

Working with R

Assuming you’ve never used R before, get R installed from https://cran.rstudio.com/ and potentially RStudio as your IDE from https://www.rstudio.com/products/rstudio/download/#download once R is installed. To get the needed packages for creating an alluvial diagram, do the following commands:

packages.install(‘tidyverse’)
packages.install(‘ggalluvial’)
library(tidyverse)
library(ggalluvial)

More info about ggalluvial is available at https://cran.r-project.org/web/packages/ggalluvial/vignettes/ggalluvial.html and more details about the functions therein can be found at https://www.rdocumentation.org/packages/ggalluvial/versions/0.9.0 (or any newer version when they release).

With the tidyverse and ggalluvial libraries loaded up, you can import your data. Again, if not familiar with R, one easy way to get data set up is to read it from a CSV. If our table is in EggsAndHam.csv, then we can use the following:

BreakfastData <- read.csv(‘EggsAndHam.csv’)

This should create a data frame which you can see in RStudio. From there, you can generate a graph with the following code:

ggplot(breakfastData,
       aes(weight = freq, axis1 = Eggs, axis2 = Ham)) +
    geom_alluvium(aes(fill = Ham, color = Ham), 
                  width = 1/12, alpha = 0.8, knot.pos = 0.4) +
    geom_stratum(width = 1/4,color = "grey") +
    scale_fill_manual(values  = c("darkred", "darkorange4", "darkgoldenrod4", "darkolivegreen4", "cadetblue4")) +
    scale_color_manual(values = c("darkred", "darkorange4", "darkgoldenrod4", "darkolivegreen4", "cadetblue4")) +
    geom_text(stat = "stratum", label.strata = TRUE) +
    scale_x_continuous(breaks = 1:2, labels = c("Eggs", "Ham"))  +
    theme_minimal() +
    theme(
        legend.position = "none",
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        axis.text.y = element_blank(),
        axis.text.x = element_text(size = 14, face = "bold")
    )


This should give you a chart with the flow between egg responses and ham responses. In this example, we can see there’s an inverse correlation between green eggs and ham! People who like eggs tend to not like ham, and people who like ham tend to not like eggs.

To interpret the chart, here are a few details that can be helpful to explain while presenting the data:

  • The size of each box in the strata on either edge of the chart represents the number of survey responses with that answer.

  • Similarly, the size of each flow line represents the number of individuals who responded with the two connected answers at either end of the flow line.

  • (As coded above) The color of each flow line represents the interest in ham, to make it easy to see the connection from the eggs side.

Visualization Details

If you want to tweak the look of the chart and aren’t terribly familiar with R:

  • Eggs is on the left and Ham is on the right because of the aesthetic (aes) settings for ggplot as a whole. If you changed axis1 to Ham and axis2 to Eggs, the order on the chart would be swapped.

  • The width of the Eggs and Ham strata are set by width=1/4 in geom_stratum. This can be changed to another fractional value (lower values are narrower, as you might expect). The size is relative to the overall chart size, so if you test on a small chart and export a big chart, you may end up with wider strata than you anticipated.

  • The colors are based on the interest in Ham, going red/orange/yellow/green/blue as interest rises. The color choices can be seen (and edited) in the scale_fill_manual and scale_color_manual lines. The colors being associated with Ham interest is determined by the aesthetic (aes) settings for geom_alluvium. You could change fill and color to Eggs to have the color correspond to interest in eggs.

  • Also on color, the slight transparency is due to alpha = 0.8 in the aesthetic settings for geom_alluvium. This can be decreased to make flow lines more transparent (or increased to make them more opaque, of course).

Simplifying the chart by showing fewer responses

One easy expansion of this work is to limit how many of the responses to the first question are being analyzed. The original alluvial diagram above is a little messy, so restricting the Eggs side to 4/5 and 5/5 can make it easier to see the negative correlation, and you’ll still be able to make the case that “people who like eggs don’t like ham” with the new chart. A quick way to do this is to copy your CSV file and update it, removing the non-4/5 and 5/5 answers — but make sure to leave a 0-freq entry for each number in the Eggs side! If the Eggs column doesn’t have 1/5, 2/5, or 3/5, these will be ordered incorrectly on the Ham side (see below).

Bad Example

Eggs

Ham

freq

5 (Quite interested)

1 (Not interested)

3

5 (Quite interested)

2 (Not very interested)

1

4 (Pretty interested)

1 (Not interested)

1

4 (Pretty interested)

2 (Not very interested)

4

4 (Pretty interested)

3 (Somewhat interested)

5

4 (Pretty interested)

5 (Quite interested)

1

This is a bad example; the Ham side is ordered illogically.

By adding 0-freq entries for 1/5, 2/5, and 3/5, you can avoid the weird behavior where the Ham side is ordered incorrectly. In the below example, we have 0-freq entries for 1/5, 2/5, and 3/5 rankings on the eggs side. These all go 1-1, 2-1, 3-1 so that we don’t get any extra weird lines in the chart going from the top left over to the 2 and 3 on the right.

Good Example

Eggs

Ham

freq

1 (Not interested)

1 (Not interested)

0

2 (Not very interested)

1 (Not interested)

0

3 (Somewhat interested)

1 (Not interested)

0

5 (Quite interested)

1 (Not interested)

3

5 (Quite interested)

2 (Not very interested)

1

4 (Pretty interested)

1 (Not interested)

1

4 (Pretty interested)

2 (Not very interested)

4

4 (Pretty interested)

3 (Somewhat interested)

5

4 (Pretty interested)

5 (Quite interested)

1

This is a better example where both sides are ordered logically.

And with that, you’ve got a striking way to clearly show the correlation between survey results! This is a great companion to the typical bar and pie charts used to review survey responses.

Ways to expand this work

  • Another R library (or a different programming language’s libraries) might allow for a chart style that looks a bit better — for instance, it might be nice to put space between each of the ratings such that both sides are aligned with the center of each ranking group lined up across the way.

  • ggalluvial supports a chart with several sets of strata, not just two as have been used here, so you could do a comparison between more than 2 survey questions if you have a larger data set. Just make sure it stays readable!

  • An R dev with more experience using Shiny or a similar web-focused system should be able to make charts that are interactive. I’d love to, when mousing over a specific flow line, be able to see the start and end points and how many people that flow line represents. Perhaps that will be covered in a future post!

Analytics, REvan Barale