Quantifying the User Experience without statistics

Say goodbye to statistical tests and use the Bootstrap instead

UX researchers are increasingly asked to quantify the results of their research. Measuring improvements in business outcomes is also a great way to justify the value of research when facing business partners who would rather skip it. However, when your background is in qualitative research and your last exposure to statistics is that one class you took in grad school, statistics can seem intimidating, if not downright sorcery: “if your sample size is less than 20, your target variable is binary and it’s a full moon, then add 2 successes, 2 failures, 3 drops of salamander blood and calculate the adjusted-Wald metric while stirring counterclockwise”. Well, I’m here to tell you that it doesn’t have to be that way.

Statistics were developed long before the advent of computers, when calculations had to be done painstakingly by hand, and for the purpose of scientific research. Nowadays, if your goal is to drive effective decision-making (as is the case in UX research), the Bootstrap method offers a much simpler but remarkably robust alternative. In the rest of this post, I’ll introduce you to this method and show what it looks like in a few UX examples, using code in R and Python.

Example 1: confidence interval for a success rate

Let’s start with a concrete example from Sauro & Lewis [1]: we gave 10 users a task, and 7 of them were able to complete it. How certain are we of our success rate?

1.Traditional approach. In statistics, we rely on statistical distributions, typically the binomial or the normal distribution, to estimate confidence intervals (CIs). With the statistical approach, the corresponding CI are [42%; 98%] using the Wald method, [35%; 93%] using the exact method, and [39%; 90%] using the (preferred) adjusted-Wald method.

2.Bootstrap approach. With the Bootstrap, instead of making a statistical assumption about the population from which our sample is taken, we simply assume that our population is made up of a practically infinite number of copies of our sample. In the present case, imagine for instance a population of 10 million users, of which 7 million were able to complete the task. We then draw from this population a large number of new samples, called Bootstrap samples, of the same size as our original sample and observe their distribution. Thus, a Bootstrap sample may have 7 black balls and 3 white ones like the original sample, but it may also be 6B-4W, 8B-2W, etc.

In practice, simulating 10 million users would be tedious and it’s unnecessary. Instead, we draw “with replacement” from our observed values. Picture now a small urn containing 7 black balls (the successes) and 3 white balls (the failures). Drawing with replacement means that we draw a ball at random from the urn, observe its color, then place it back into the urn before drawing again, and so on until we have observed the same number of balls (10) as your original sample. By putting a ball back into the urn after observing it, we ensure that the probability of observing a certain value remains constant, as if we were drawing without replacement from an entire population.

Drawing with replacement is very simple in R or Python:

## R
boot_dat <- slice_sample(dat, n=nrow(dat), replace = TRUE)## Python
boot_df = data_df.sample(len(data_df), replace = True)

Let’s generate many such bootstrapped samples, typically a few hundreds to a few thousands, and calculate the percentage of success in each case. For example, the following graph shows an histogram for 200 bootstrapped samples.

quantifying-the-user-experience-without-statistics-2ad867e326d2

Fig.1: Bootstrap samples — count by percentage of success (image by author)

As you can see in figure 1, almost 60 of our 200 simulated samples have 70% of successes, but there are some significant variations. At the lower end of the spectrum, 2 of them had only 30% of successes, whereas at the upper end, 10 of them had 100% of successes.

To form our Bootstrap confidence interval, we simply discard the simulated samples with the most extreme values. For a 95%-CI, this means discarding 200*(1–0.95)=10 values, 5 from each side. The lowest and highest remaining values are the bounds of our CI.

Fig.2: removing the most extreme values (image by author)

In this case, the 95%-CI is [40%; 100%], pretty close to the traditional CIs from statistical methods. Of course, instead of removing the most extreme values by hand, there is a simpler solution if you’re familiar with quantiles: we could have taken directly the 2.5% and 97.5% quantiles of our simulated samples.

## R
LL_b <- quantile(boot_summ$mean, c(0.025))
UL_b <- quantile(boot_summ$mean, c(0.975))

Example 2: confidence interval for a completion time

Let’s look at a second example, again from Sauro & Lewis [1], where we measure the completion time for a task. The 10 observed values are: 94, 95, 96, 113, 121, 132, 190, 193, 255, 298. One difficulty with completion time data is that it tends to be positively skewed and cannot take negative values, which makes the statistical approach more complex. Instead, let’s generate 200 simulated Bootstrap samples and see what their distribution looks like.

Fig.3: Bootstrap samples — count by average completion time (image by author)

This histogram is very irregular; this means that our number of samples is too small relative to the variability of our data. We can easily correct that by generating 2,000 samples instead.

Fig.4: Bootstrap samples — count by average completion time (image by author)

This looks smoother. We now need to remove 2,000*(1–0.95)=100 values, 50 from each side.

Fig.5: Bootstrap samples — count by average completion time (image by author)

The corresponding 95%-CI is [119; 202]. This is again pretty close to the one obtained by Sauro & Lewis, namely [108; 198], and we don’t have to go through the log-transformation they apply to the data.

The Bootstrap approach is simpler

The greatest advantage of the Bootstrap approach over the traditional statistical approach is that it is much simpler. With the traditional approach, you may need to consider a bewildering variety of factors such as:

the sample size (“should I use the t-distribution or the normal distribution with 25 subjects?”), and its dreaded relative, the number of degrees of freedom,
the metric you’re using (“how do I calculate a CI for the median instead of the mean?”),
the skewness of your data (“how do I calculate a CI for a completion time which can only be positive?”)
the goal of your study (“am I trying to estimate the variability of data in one group, to compare a group to a benchmark, or to compare two groups to each other?”)

And once you have picked a test, you have no way of knowing if you got the right one or not. In my opinion, this feels a bit too much like a Green Day song: “make the best of this test, and don’t ask why”.

On the contrary, with the Bootstrap approach, there are only two things you need to understand and become familiar with:

Simulating samples by drawing with replacement,
Calculating the bounds of your CI by “pruning” the most extreme values

That’s it! Then you can apply the approach to calculate any confidence interval whatsoever. The only limitation of the Bootstrap is that it can take a while to run a simulation with large data (think more than tens or hundreds of thousands of rows), but you can deal with it by letting your program run overnight.

Finally, let’s see what the Bootstrap looks like with a more complex example, where business partners have requested a p-value.

Example 3: P-value for the comparison of two groups

In my (hopefully informed) opinion, p-values are misleading and obsolete. Their only benefit in an applied setting like UX is to remind our business partners that an observed value may be due to chance and not representative, but this is more than compensated by the confusion p-values introduce (I elaborated more on this in a previous Medium post [3]). However, by now many business partners are used to seeing them, so it can be useful to know how to calculate a similar metric with the Bootstrap.

We have the following data comparing the SUS scores of two CRM applications for different groups (Sauro & Lewis [1]):

SUS scores, image by author based on Sauro & Lewis

When we have independent groups, we must draw with replacement separately for each group. One way of thinking about it is that we’re drawing at the level of a behavioral unit: independent individuals are separate units. On the other hand, for dependent groups such as before-after comparisons for the same individuals, we would draw at the individual level across both groups. Let’s generate 2,000 Bootstrap samples to get a smooth graph:

Fig.6: Bootstrap samples — count by difference in average scores between app A and app B (image by author)

As we can see in Figure 6, application A tends to have a higher average SUS score than application B — the differences are mostly above zero, but there are a non-negligible number of values below zero. This is confirmed by the 95%-CI being [-1.08, +5.13] (Sauro & Lewis: [-1.8; +5.8]). While the Bootstrap approach doesn’t lend itself to p-values, it offers a comparable metric, the Achieved Significance Level (ASL). Ironically, the interpretation of the ASL is close to the one people often wrongly assign to the p-value: the probability that the true value is zero or of the “wrong” sign. As with the p-value, the ASL is calculated as the percentage such that the corresponding CI has zero as one of its bounds.

Fig.7: Bootstrap samples — count by difference in average scores between app A and app B (image by author)

Here, the ASL is 0.205, or 20.5%. This value is lower than Sauro & Lewis’s 28.35%, possibly because the data is a bit skewed, which a t-test wouldn’t account for. Thus, if our business partners ask us if the difference is “statistically significant”, we can tell them that “the ASL is 20%, so there’s a 20% chance that the true value is zero or less” (if you want to be pedantic, that’s still technically not true, as you would need a full Bayesian model to estimate that probability, but it will lead you astray less often than p-values).

Conclusion

Now that even a basic laptop can run hundreds of simulations in a few seconds, the Bootstrap approach offers an attractive alternative to the traditional statistical approach to quantify UX outcomes. It allows you to calculate confidence intervals and achieved significance levels in all circumstances, regardless of the size or shape of your data. When the assumptions of traditional statistical tests are fulfilled, the Bootstrap generally yields very similar results; but when they’re not, it is a much better guide.

But wait, there’s more!

Here comes the final shameless plug. If you want to learn more about the Bootstrap, my book [2] will show you:

How to determine the number of Bootstrap samples to draw (and why it’s okay to change that number as you see fit);
How to apply the Bootstrap to regression, A/B test and ad-hoc statistics;
How to use advanced versions of the Bootstrap that are even more accurate;
And a lot of other cool things about analyzing customer and user behavior data.

Quantifying the User Experience without statistics | by Florent Buisson | Feb, 2...

Quantifying the User Experience without statistics

Say goodbye to statistical tests and use the Bootstrap instead

Example 1: confidence interval for a success rate

Example 2: confidence interval for a completion time

The Bootstrap approach is simpler

Example 3: P-value for the comparison of two groups

Conclusion

But wait, there’s more!

Recommend

I’ve been doing buttons wrong! Have you?

Are you Struggling to Structure your Product Description for your Ecommerce Site...

Only Python: Friendly-traceback: trying to stay ahead of IPython

Hate Usability Testing? You’re Doing It Wrong

产业政策扶植数字经济火爆绕开专业门槛投资者宜指数化布局

我们需要什么样的企业家

格雷资产董事长、创始人张可兴：避开发展受限行业，在食品饮料、互联网等领域捕捉高胜...

10 Traits You Need To Be A Successful UX Designer

鹏华基金王宗合放手一搏，从白酒核心资产转向科技成长

Top UX & Design Books

About Joyk