John Fremlin's blog: Busily seeking significance in A-B tests
source link: http://john.freml.in/ab-testing-significance
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
John Fremlin's blog: Busily seeking significance in A-B tests
Waiting for updates: connected
Posted 2012-05-28 22:00:00 GMT
What is a coherent statistical test to determine whether a website variation is better than another?
Client-isolated software, like a website or many modern mobile device apps, presents new opportunities for measuring statistics of the user experience and tailoring it to better match the user. The A-B test is a simple manifestation of this concept: by experimenting with superficial changes on a web page: variation A being Add to basket against variation B Buy! for example, a 10-20% increase in the rate of people purchasing might be obtained.
This is very attractive because changing the text or the colour or the size of a button is easy. However, bigger changes to the user interface can boost rates by a few multiples instead of a few percent. For example, adding a new user interface element like showing the basket prominently on the page. A negative example might be removing any navigational or search elements from the page once something has been added to the basket — so that the user cannot be distracted from purchase. These more significant changes should be tested with the same or even higher rigour than text variations, because their effects can be wide-ranging and complex.
Rather than just two alternatives it-s convenient to test many at the same time. It's easy to come up with text variations! With many variations it's very likely that one will appear better than the others. However the same is true of lottery tickets or race horses. The question is, when do we decide that one variation is definitely better? There has been a lot of voodoo statistics published about it with mysterious Excel incantations. Here I attempt to lay out a clear and unambiguous methodology.
Firstly, to simplify issues with duplicate attempts: count the number of unique users exposed to each variation (not the number of impressions of each variation) and the number of unique users exposed to each variation that converted (e.g. added to basket, or purchased, or purchased and did not request a refund), not the number of purchases. This prevents the results being distorted by a minority of very prolific users. The distortion is generally reduced the longer term the conversion metric is.
Now take the overall count of users exposed (n) and users converted (x) across all variations. This gives an average conversion rate p = x/n. For a given variation with m users exposed and y converted, consider whether a sample of size m taken randomly from the binomial distribution with probability p would have achieved y conversions. Now if P(Bin(m,p) ≥ y) is low then we can presume that this variation has some feature that causes it to be advantageous, beyond randomly selecting some subset of conversions. To compute this quantity in R, use 1-pbinom(y, m, p). This test will show less significance than comparing the conversions and exposures of the set of people not exposed to the variation with those exposed, so it errs on the side of safety.
Once the test has determined that the one variation has a low probability of having the same conversion rate as the overall mix, the two rates should be modelled separately. Does this mean that one variation is better than another and should be deployed in the future? However great the improvement shown by the better variation this is not altogether clear. The variations may perform differently when the (typically latent) factors affecting their conversion rates naturally shift: for example, there may be novelty effects where a new variation performs well because it is intriguing, seasonal effects as for a Christmas centred message, or effects depending on the hunger for food of the interacting user.
Taking a concrete example: a shopping site chooses either an Amber or a Bronze background and directs m = 100 000 users to each variant. For the Amber variant 900 users convert and for the Bronze variant 1100 users convert. So p = 1%. The Bronze variant has 1-pbinom(1100, 100000, 0.01) = 0.08% chance of being the same as the overall conversion rate.
What however is driving the difference in the conversion rates? Busy people like the Bronze background and convert at 11% while not at all on Amber, but Apathetic people prefer Amber and convert at 1% on it and do not like the bronze form at all. In fact all the difference in the conversion rates is driven by the proportion of people who are bored: which is in this case 90%. But come a busy day of the week or a holiday who knows!
Therefore one must be very circumspect about plumping on a variation: keep a holdout group and evaluate the results over time.
Recommend
-
15
John Fremlin's blog: Testing systems at scaleWaiting for updates: connectedPosted 2020-04-04 22:00:00 GMTTest driven development is trendy. Interview loops and promotion...
-
11
John Fremlin's blog: Rotating an image with OpenCV and PythonWaiting for updates: connectedPosted 2014-04-05 19:07:00 GMTOpenCV is the most widely used open-source v...
-
25
John Fremlin's blog: Exactly-Once in multithreaded async PythonWaiting for updates: connectedPosted 2020-07-31 22:00:00 GMTPython's builtin greenlets with asyncio are com...
-
13
John Fremlin's blog: Recruiting software engineers and their CVsWaiting for updates: connectedPosted 2017-03-03 23:00:00 GMTHaving conducted hundreds of software and othe...
-
14
John Fremlin's blog: A single point of failure is okWaiting for updates: connectedPosted 2016-10-05 01:11:00 GMTMaking big systems out of many computers, people often end...
-
11
John Fremlin's blog: Nothing was a billion dollar mistakeWaiting for updates: connectedPosted 2018-01-19 23:00:00 GMTTony Hoare, inventor of quicksort and many other foun...
-
14
John Fremlin's blog: Kotlin is a better JavaWaiting for updates: connectedPosted 2017-03-12 23:00:00 GMTThe Kotlin programming langu...
-
10
John Fremlin's blog: Curse of business logicWaiting for updates: connectedPosted 2019-05-18 22:00:00 GMTPeople complain that a piece of software is bogged down in te...
-
9
John Fremlin's blog: Square CTF 2017 Grace HopperWaiting for updates: connectedPosted 2017-10-18 22:00:00 GMTSquare put on a great compet...
-
8
John Fremlin's blog: Bad unit tests impart a false sense of securityWaiting for updates: connectedPosted 2016-06-21 10:45:00 GMT
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK