25

Use a Negative Binomial for Count Data

 4 years ago
source link: https://towardsdatascience.com/use-a-negative-binomial-for-count-data-c68c062de203?gi=386236a4fa34
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Up your Statistics Game

Aug 2 ·8min read

The Negative Binomial distribution is a discrete probability distribution that you should have in your toolkit for count data. For example, you might have data on the number of pages someone visited before making a purchase or the number of complaints or escalations associated with each customer service representative. Given this data, you might want to model the process and, later, see if some covariates affect the parameters. And in many contexts, you might find that a negative binomial distribution is a good fit.

In this article we’ll introduce the distribution and compute its probability mass function (PMF). We’ll cover its basic properties (mean and variance) by using the binomial theorem. This is in contrast to the usual treatments you will find which either just give you a formula or use more complicated tools to derive the results. Finally, we’ll turn to focus on the distributions’ interpretations.

The Negative Binomial Distribution

Suppose you are going to flip a biased coin that has probability p of coming up heads, which we will call a “success.” Furthermore, you are going to flip the coin continuously until at r successes occur. Let k be the number of failures along the way (so k+r coin flips happen in total).

In the context of our examples, we could imagine:

  • A user might browse your website. On each page they have a probability of p= 1% of seeing an item they want to buy. We imagine that when they have put r =3 items in their basket, they are ready to checkout. k is the number of pages they will browse and not buy from. Of course we will want to fit the model to find the true values of r and p as well as if/how they vary between users.
  • A customer service representative might in general receive complaints. After receiving complaints, there is a probability p that they will be reprimanded. Then after r times being told off, they will stop getting complaints due to changed behavior. k is the number of complaints on which they are not reprimanded before they change their behavior.

Whether you actually think this is true is, as always, up to your prior beliefs and how well the model fits the data. Also, note that the number of failures is closely related to the number of events (k versus k plus r).

It is relatively straightforward to write down the probability mass function using some combinatorics. The probability that the r -th success happens on the (k+r)-th coin flip is:

  1. The probability that there are r–1 successes on the first k+r–1 flips, times
  2. The probability of success on the ( k+r)- th flip.

There are (k+r–1) choose k orderings of (r–1) successes and k failure on the first k+r–1 flips. (The number of ways to arrange k A’s and (r–1) B’s in a line). Each has the same probability of occurring. This gives the PMF:

y2uMNrj.png!web

Hopefully you remember some basic facts about combinations and permutations. If not, here is a brief review of facts you can convince yourself of to help you out. Suppose there are 3 A’s and 2 B’s and you want to arrange them into a string like “AAABB” or “ABABA”. The number of ways to do this is 5 choose 2 (there are 5 total things and 2 B’s) which is the same as 5 choose 3 (there are 3 A’s). To see this, pretend that each letter is actually a distinct symbols (so the 5 symbols are A1, A2, A3, B1, B2). Then there are 5!=120 ways to arrange the distinct symbols. But there are 3!=6 ways to rearrange the A1 A2 A3 without changing the placements of the A’s, and 2!=2 ways to arrange the B’s. So the total number is 5!/2!3! = 10.

Now, the trick is, binomials also work for negative numbers on top, or with non-integers. For example, if we expand what we have above, we can add a minus sign to each of the k terms in the numerator:

bqmmme.png!web

The Negative Binomial Distribution as an actual Negative Binomial

Hence the name “negative binomial.”

The other trick to keep in mind is that we can define binomials with non-integer numbers. Using the fact that the Γ function ( Gamma function ) satisfies, for positive integers n ,

yyEjEzu.png!web
The Gamma Function extends the Factorial

We can write our binomial coefficients in the form

yyyuMze.png!web
Binomial Coefficients with n not an integer

And this enables us to allow that, in the negative binomial distribution, the parameter r does not have to be an integer. This will be useful because when we estimate our models, we generally don’t have a way to constrain r to be an integer. So a non-integer value for r won’t be a problem. (We will require r to be positive, however). We’ll come back to how to interpret a non-integer value of r .

Properties of the Negative Binomial Distribution

We would like to compute the expectation and variance. As a warmup, let’s check that the negative binomial distribution is in fact a probability distribution. For convenience, let q=1–p .

ZFVnAfY.png!web

The Negative Binomial Distribution is in fact a Probability Distribution

The crucial point is the third line, where we used the binomial theorem (yes, it works with negative exponents).

Now let’s compute the expectation:

ENrqQ3J.png!web

Expected Value of the Negative Binomial Distribution

To get the third line, we used the identity

UFZRNfI.png!web

Where we used the binomial theorem again to get the third to last line.

Warning: this is the opposite of what you will find on Wikipedia as of this writing. It is what you will find from Wolfram (the makers of Mathematica). This is because Wikipedia thinks about the number of successes before r failures, where as we count failures before r successes. In general, there is a variety of similar ways to parameterize/interpret the distribution, so be careful you have everything straight when looking at formulas in different places.

Next, we can compute the variance in two steps. First, we repeat the trick from above, using the identity twice this time to get the third line. We again use the binomial theorem to compute the sum and obtain the third-to-last line.

6B7zy2b.png!web

Now we can compute:

fAZVjyv.png!web

Variance of the Negative Binomial Distribution

Again, this is the opposite of what is on Wikipedia.

Interpretation of the Negative Binomial Distribution

We have covered the “defining interpretation” of the Negative Binomial Distribution: it is the number of failures before r success occur, with the probability of success at each step being p . But there are a few other ways to look at the distribution that can be illuminating and also help interpret the case where r is not an integer.

Over-Dispersed Poisson Distribution

The Poisson distribution is a very simple model for count data, which assumes that events happen randomly at a certain rate. Then it models the distribution of how many events will occur in a given time interval. In the context of our examples, it would say that:

  • Customer service representatives get complaints at a constant rate. The variation in counts is just determined by random variation. (Compare the model where their behavior eventually changes). Again, in modeling this, we could model a difference in rate between representatives based on exogenous covariates.

One big problem with the Poisson distribution is that the variance is equal to the mean. This may not fit our data. Let’s say we parameterize our Negative Binomial distribution with a mean λ and stopping parameter r . Then we have

f6NBfuI.png!web

Re-parametrization of the Negative Binomial Distribution

Our probability mass function becomes

FzAzA3y.png!web

Probability Mass Function for the Negative Binomial parameterized with Mean λ

Now let’s consider what happens if we take the limit as r →∞ holding λ fixed. (This means that the probability of success goes to 1 as well, in the way defined by p=r/[λ+r]). In this limit, the binomial term approaches (–r) to the power of k divided by k! and r + λ approaches r.

3may6fB.png!web

Limit of the Negative Binomial for Large r with fixed mean λ

In the last line, the r to the k-th powers cancel and we have used the definition of the exponential. The result is that we recover the Poission distribution.

Therefore, we can interpret the Negative Binomial Distribution as a generalization of the Poisson distribution. If the distribution is in fact Poission, we will see a large r and p close to 1. This makes sense because as p approaches 1, the variance approaches the mean. When p is smaller than one, the variance is higher than that of a Poisson distribution with the same mean, so we can see that the Negative Binomial distribution generalizes Poisson by increasing the variance.

Mixture of Poisson Distributions

The Negative Binomial Distribution also arises as a mixture of Poisson random variables. For example, suppose that our customer service representatives each receive complaints at a given rate (they never change their behavior), but that rate varies between representatives. If that rate is randomly distributed according to a Gamma distribution , we get a Negative Binomial Distribution for the ensemble.

The intuition behind this is as follows. We initially said the Negative Binomial Distribution was the count of failures before r successes when we do coin flips. Instead, replace the coin flip with two Poisson processes. Process one (the “success” process) has rate p and process two, the “failure” process, has rate (1-p). This means that instead of thinking of the Negative Binomial Distribution as counting coin flips, we think that there are independent processes generating “success” and “failure” independently and we just count how many failures before a certain number of successes.

Now, the Gamma Distribution is the distribution of waiting times for Poisson processes. Let T be the waiting time for r successes from the “success” process. T is Gamma distributed. Then the number of failures has a mean of (1–p)T and is Poisson distributed.

Conclusion

The last few points worth pointing out. First of all, there is no analytic way to fit the Negative Binomial Distribution to data. Instead, use the Maximum Likelihood Estimator and numerical estimation. You can use the statsmodels package to do this in Python.

Also, it is possible to do Negative Binomial regression, modeling the effects of covariates. We’ll save that for a future article.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK