John Fremlin's blog: Busily seeking significance in A-B tests

Waiting for updates: connected

Posted 2012-05-28 22:00:00 GMT

What is a coherent statistical test to determine whether a website variation is better than another?

Client-isolated software, like a website or many modern mobile device apps, presents new opportunities for measuring statistics of the user experience and tailoring it to better match the user. The A-B test is a simple manifestation of this concept: by experimenting with superficial changes on a web page: variation A being Add to basket against variation B Buy! for example, a 10-20% increase in the rate of people purchasing might be obtained.

This is very attractive because changing the text or the colour or the size of a button is easy. However, bigger changes to the user interface can boost rates by a few multiples instead of a few percent. For example, adding a new user interface element like showing the basket prominently on the page. A negative example might be removing any navigational or search elements from the page once something has been added to the basket — so that the user cannot be distracted from purchase. These more significant changes should be tested with the same or even higher rigour than text variations, because their effects can be wide-ranging and complex.

Rather than just two alternatives it-s convenient to test many at the same time. It's easy to come up with text variations! With many variations it's very likely that one will appear better than the others. However the same is true of lottery tickets or race horses. The question is, when do we decide that one variation is definitely better? There has been a lot of voodoo statistics published about it with mysterious Excel incantations. Here I attempt to lay out a clear and unambiguous methodology.

Firstly, to simplify issues with duplicate attempts: count the number of unique users exposed to each variation (not the number of impressions of each variation) and the number of unique users exposed to each variation that converted (e.g. added to basket, or purchased, or purchased and did not request a refund), not the number of purchases. This prevents the results being distorted by a minority of very prolific users. The distortion is generally reduced the longer term the conversion metric is.

Now take the overall count of users exposed (n) and users converted (x) across all variations. This gives an average conversion rate p = x/n. For a given variation with m users exposed and y converted, consider whether a sample of size m taken randomly from the binomial distribution with probability p would have achieved y conversions. Now if P(Bin(m,p) ≥ y) is low then we can presume that this variation has some feature that causes it to be advantageous, beyond randomly selecting some subset of conversions. To compute this quantity in R, use 1-pbinom(y, m, p). This test will show less significance than comparing the conversions and exposures of the set of people not exposed to the variation with those exposed, so it errs on the side of safety.

Once the test has determined that the one variation has a low probability of having the same conversion rate as the overall mix, the two rates should be modelled separately. Does this mean that one variation is better than another and should be deployed in the future? However great the improvement shown by the better variation this is not altogether clear. The variations may perform differently when the (typically latent) factors affecting their conversion rates naturally shift: for example, there may be novelty effects where a new variation performs well because it is intriguing, seasonal effects as for a Christmas centred message, or effects depending on the hunger for food of the interacting user.

Taking a concrete example: a shopping site chooses either an Amber or a Bronze background and directs m = 100 000 users to each variant. For the Amber variant 900 users convert and for the Bronze variant 1100 users convert. So p = 1%. The Bronze variant has 1-pbinom(1100, 100000, 0.01) = 0.08% chance of being the same as the overall conversion rate.

What however is driving the difference in the conversion rates? Busy people like the Bronze background and convert at 11% while not at all on Amber, but Apathetic people prefer Amber and convert at 1% on it and do not like the bronze form at all. In fact all the difference in the conversion rates is driven by the proportion of people who are bored: which is in this case 90%. But come a busy day of the week or a holiday who knows!

Therefore one must be very circumspect about plumping on a variation: keep a holdout group and evaluate the results over time.

John Fremlin's blog: Busily seeking significance in A-B tests

John Fremlin's blog: Busily seeking significance in A-B tests

Waiting for updates: connected

Recommend

John Fremlin's blog: Testing systems at scale

John Fremlin's blog: Rotating an image with OpenCV and Python

John Fremlin's blog: Exactly-Once in multithreaded async Python

John Fremlin's blog: Recruiting software engineers and their CVs

John Fremlin's blog: A single point of failure is ok

John Fremlin's blog: Nothing was a billion dollar mistake

John Fremlin's blog: Kotlin is a better Java

John Fremlin's blog: Curse of business logic

John Fremlin's blog: Square CTF 2017 Grace Hopper

John Fremlin's blog: Bad unit tests impart a false sense of security

About Joyk