Use Mann-Whitney for comparing medians

The first mistake is not an appropriate use of Mann-Whitney test.The method is massively misunderstood and misused, because most people use this test ,assuming that it is approximately non-parametric “t test” for medians. In reality, we can use Mann-whitney test only to check if there is a shift in our distribution.

In this plot we can see that distribution X has the same distribution as Y , just with a shift. Image by Author

When we apply Mann-Whitney test, we set our hypotheses as following:

The null hypothesis. Image by AuthorThe alternative hypothesis. Image by Author

We should always take into account the assumptions, for this test there are just two:

Our observations are i.i.d
Our distributions have the same shape

How to calculate Mann-Whitney statistics :

Arrange all the observations in order of magnitude
Assign numeric ranks to all observations
Calculate U statistics for both groups

R — sum of all ranks for the sample 1 , n — number of observations for 1 sample. Image by AuthorR — sum of all ranks for the sample 2 , n — number of observations for 2 sample. Image by Author

4. Choose the minimum from these two values

5. Use statistical tables for the Mann-Whitney U test to find the probability of observing this value of U or lower.

So we figured out that we can’t use it for comparing medians , what we should use instead?

Fortunately for us, the statistician Frank Wilcoxon developed in 1945 the signed rank test. Now it is officially called “The Wilcoxon Signed Rank Test”.

Our hypotheses for the test are as we expected at the beginning:

The null hypothesis, m -medians. Image by AuthorThe alternative hypothesis,m -medians. Image by Author

How to calculate The Wilcoxon Signed Rank Test statistics :

For each pair of observations, compute the difference, and keep its absolute value and its sign
Sort the absolute values from smallest to largest, and rank them accordingly.
Finally , compute the test statistics:

Image by Author

4. W has a known distribution. If n is greater than about ,let’s say , 20, it’s approximately normally distributed. Therefore we can measure the probability of observing it under a null hypothesis and thereby obtain a significance level.

A bit of intuition behind the formula:
If the median difference is 0, then half the signs should be positive and half should be negative, and the signs shouldn’t be related to the ranks. If the median difference is nonzero , then W will be large.

Use bootstrapping for all cases and every dataset.

The second mistake is using bootstrapping all the time. I have seen so many times when people apply bootstrapping to every dataset without even a few preliminary actions in order to make sure that we can use bootstrapping at this case.

The prime assumption for applying bootstrapping is:

The sample should represent the population from which it was drawn

If our data sample is biased and don’t represent our population pretty good , our bootstrapping statistic will suffer as well and will be biased.That’s why we should measure proportions of different cohorts and segments.

If there are only women in our data sample , but in our whole customer database the genders are distributed equally , we can’t apply bootstrapping here.

The good practice is to compare all our main segments between our whole population and the dataset.

Always use the default values for type I and II errors.

The last , but not the least is choosing the right parameters for an experiment. In 95% of cases , 95% of data analysts / scientists , at 95% of companies the default values are used => 5 % for type I error rate and 20% for type II error rate (or 80% for power of test).

Why we just can’t choose 0% for type I error rate and 0% for type II error rate ?

It inevitably leads to an infinite amount of samples we have to collect and our experiment will last forever.

It’s definitely not what we are looking for.That’s why we have to compromise between number of samples we need to obtain and our error rates.

I foster people to take into account all possible specifications of your product. The most convenient way to do it , create the table ,that you see below, and discuss it with product managers and people who are responsible for the product.

Typical table to make a decision. MDE — minimum detectable effect. Image by Author

For Netflix even 1% MDE can lead to a significant profit , but for small startups it’s not the case. For Google it’s an absolute breeze to involve in an experiment event dozens of millions of people , therefore it’s better to set your type error rate as 0.1 % and to be more confident of your result.

Common Mistakes During A/B Testing

Use Mann-Whitney for comparing medians

So we figured out that we can’t use it for comparing medians , what we should use instead?

Use bootstrapping for all cases and every dataset.

Always use the default values for type I and II errors.

Recommend

不仅仅能买火车票！12306买汽车票功能体验

马斯克全资收购推特

Celebrate Cinco de Mayo with Forza Horizon 5

How Comfort and Conformity Are Dream Killers

JCart: Admin Forgot Password

【亚马逊case路径】还不知道在哪个入口找到变体相关信息？

Art Fundamentals: How Illumination & Shadow Add Meaning to Artworks

Apache Kafka Consumer Lag Monitoring

Autonomous trucking transition to drive job change, not job loss

屏占比大增！iPhone 14屏幕面板曝光：两款挖孔、两款刘海

About Joyk