7

Common Mistakes During A/B Testing

 2 years ago
source link: https://towardsdatascience.com/common-mistakes-during-a-b-testing-bdb9eefdc7f0
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Use Mann-Whitney for comparing medians

The first mistake is not an appropriate use of Mann-Whitney test.The method is massively misunderstood and misused, because most people use this test ,assuming that it is approximately non-parametric “t test” for medians. In reality, we can use Mann-whitney test only to check if there is a shift in our distribution.

In this plot we can see that distribution X has the same distribution as Y , just with a shift. Image by Author

When we apply Mann-Whitney test, we set our hypotheses as following:

The null hypothesis. Image by AuthorThe alternative hypothesis. Image by Author

We should always take into account the assumptions, for this test there are just two:

  1. Our observations are i.i.d
  2. Our distributions have the same shape

How to calculate Mann-Whitney statistics :

  1. Arrange all the observations in order of magnitude
  2. Assign numeric ranks to all observations
  3. Calculate U statistics for both groups

R — sum of all ranks for the sample 1 , n — number of observations for 1 sample. Image by AuthorR — sum of all ranks for the sample 2 , n — number of observations for 2 sample. Image by Author

4. Choose the minimum from these two values

5. Use statistical tables for the Mann-Whitney U test to find the probability of observing this value of U or lower.

So we figured out that we can’t use it for comparing medians , what we should use instead?

Fortunately for us, the statistician Frank Wilcoxon developed in 1945 the signed rank test. Now it is officially called “The Wilcoxon Signed Rank Test”.

Our hypotheses for the test are as we expected at the beginning:

The null hypothesis, m -medians. Image by AuthorThe alternative hypothesis,m -medians. Image by Author

How to calculate The Wilcoxon Signed Rank Test statistics :

  1. For each pair of observations, compute the difference, and keep its absolute value and its sign
  2. Sort the absolute values from smallest to largest, and rank them accordingly.
  3. Finally , compute the test statistics:

Image by Author

4. W has a known distribution. If n is greater than about ,let’s say , 20, it’s approximately normally distributed. Therefore we can measure the probability of observing it under a null hypothesis and thereby obtain a significance level.

A bit of intuition behind the formula:

If the median difference is 0, then half the signs should be positive and half should be negative, and the signs shouldn’t be related to the ranks. If the median difference is nonzero , then W will be large.

Use bootstrapping for all cases and every dataset.

The second mistake is using bootstrapping all the time. I have seen so many times when people apply bootstrapping to every dataset without even a few preliminary actions in order to make sure that we can use bootstrapping at this case.

The prime assumption for applying bootstrapping is:

The sample should represent the population from which it was drawn

If our data sample is biased and don’t represent our population pretty good , our bootstrapping statistic will suffer as well and will be biased.That’s why we should measure proportions of different cohorts and segments.

If there are only women in our data sample , but in our whole customer database the genders are distributed equally , we can’t apply bootstrapping here.

The good practice is to compare all our main segments between our whole population and the dataset.

Always use the default values for type I and II errors.

The last , but not the least is choosing the right parameters for an experiment. In 95% of cases , 95% of data analysts / scientists , at 95% of companies the default values are used => 5 % for type I error rate and 20% for type II error rate (or 80% for power of test).

Why we just can’t choose 0% for type I error rate and 0% for type II error rate ?

It inevitably leads to an infinite amount of samples we have to collect and our experiment will last forever.

It’s definitely not what we are looking for.That’s why we have to compromise between number of samples we need to obtain and our error rates.

I foster people to take into account all possible specifications of your product. The most convenient way to do it , create the table ,that you see below, and discuss it with product managers and people who are responsible for the product.

Typical table to make a decision. MDE — minimum detectable effect. Image by Author

For Netflix even 1% MDE can lead to a significant profit , but for small startups it’s not the case. For Google it’s an absolute breeze to involve in an experiment event dozens of millions of people , therefore it’s better to set your type error rate as 0.1 % and to be more confident of your result.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK