Why is everything based on likelihoods even though likelihoods are so small?

Asked today

Modified today

Viewed 10k times

Suppose I generate some random numbers from a specific normal distribution in R:

set.seed(123)
random_numbers <- rnorm(50, mean = 5, sd = 5)

These numbers look like this:

 [1]  2.1976218  3.8491126 12.7935416  5.3525420  5.6464387 13.5753249  7.3045810 -1.3253062
     [9]  1.5657357  2.7716901 11.1204090  6.7990691  7.0038573  5.5534136  2.2207943 13.9345657
    [17]  7.4892524 -4.8330858  8.5067795  2.6360430 -0.3391185  3.9101254 -0.1300222  1.3555439
    [25]  1.8748037 -3.4334666  9.1889352  5.7668656 -0.6906847 11.2690746  7.1323211  3.5246426
    [33]  9.4756283  9.3906674  9.1079054  8.4432013  7.7695883  4.6904414  3.4701867  3.0976450
    [41]  1.5264651  3.9604136 -1.3269818 15.8447798 11.0398100 -0.6155429  2.9855758  2.6667232
    [49]  8.8998256  4.5831547

Now, suppose I calculate the likelihood of these numbers under the correct normal distribution::

likelihood <- prod(dnorm(random_numbers, mean = 5, sd = 5))
[1] 9.183016e-65

As we can see, even from the correct distribution, the likelihood is very, very small. Thus, it appears to be very unlikely in a certain sense that these numbers came from the very distribution they were generated from.

The only consolation is that the likelihood is even smaller when coming from some other distribution, e.g.

> likelihood <- prod(dnorm(random_numbers, mean = 6, sd = 6))
> likelihood
[1] 3.954015e-66

But this to me seems like a moot point: a turtle is faster than a snail, but both animals are slow. Even though the correct likelihood (i.e. 5,5) is bigger than the incorrect likelihood (i.e. 6,6), both are still so small!

So how come in statistics, everything is based on likelihoods (e.g. regression estimates, maximum likelihood estimation, etc) when the evaluated likelihood is always so small for even the correct distribution?

edited 1 hour ago

asked 13 hours ago

New contributor

3 Answers

Sorted by: Reset to default

The key lies not in the absolute size of the likelihood values but in their relative comparison and the mathematical principles underlying likelihood-based methods. The smallness of the likelihood is expected when dealing with continuous distributions and a product of many probabilities because you're essentially multiplying a lot of numbers that are less than 1.

The utility of likelihoods comes from their comparative nature, not their absolute values. When we compare likelihoods across different sets of parameters, we're looking for which parameters make the observed data "most likely" relative to other parameter sets, rather than looking for a likelihood that suggests the data is likely in an absolute sense.

The scale of likelihood values is often less important than how these values change relative to changes in parameters. This is why in many statistical methods, such as MLE, we're interested in finding the parameters that maximize the likelihood function, as these are considered the best estimates given the data.

Because likelihood values can be extremely small, in practice, statisticians often work with the log of the likelihood. This transformation turns products into sums, making the values more manageable and the optimization problems easier to solve, while preserving the location of the maximum.

set.seed(123)
random_numbers <- rnorm(50, mean = 5, sd = 5)

# Function to calculate log likelihood of a normal distribution
log_likelihood <- function(data, mean, sd) {
  sum(dnorm(data, mean, sd, log = TRUE))
}

# Calculating log likelihood for the correct parameters
log_likelihood_correct <- log_likelihood(random_numbers, 5, 5)
print(log_likelihood_correct)
[1] -147.4507

# Calculating log likelihood for incorrect parameters
log_likelihood_incorrect <- log_likelihood(random_numbers, 6, 6)
print(log_likelihood_incorrect)
[1] -150.5959

# Comparison
print(log_likelihood_correct > log_likelihood_incorrect)
[1] TRUE

edited 1 hour ago

answered 12 hours ago

First, as others have mentioned, we usually work with the logarithm of the likelihood function, for various mathematical and computational reasons.

Second, since the likelihood function depends on the data, it is convenient to transform it to a function with standardized maxima (see Pickles 1986).

R(θ)=L(θ)L(θ∗)where θ∗=argmaxθL(θ)R(θ)=L(θ)L(θ∗)where θ∗=arg⁡maxθL(θ)

set.seed(123)
random_numbers <- rnorm(50, mean = 5, sd = 5)

max_likelihood <- prod(dnorm(random_numbers, mean = 5, sd = 5))

nonmax_likelihood <- rep(0,1000)
j <- 1 

for (k in seq(0,10,length.out=1000)) {
  nonmax_likelihood[j] <- prod(dnorm(random_numbers, mean=k, sd=5))
  j <- j+1
}

par(mfrow = c(1, 2))

plot(seq(0,10,length.out=1000),nonmax_likelihood/max_likelihood, 
     xlab="Mean", ylab="Relative likelihood")

plot(seq(0,10,length.out=1000),log(nonmax_likelihood) - log(max_likelihood),
     xlab="Mean", ylab="Relative log-likelihood")

answered 10 hours ago

I can think of two things that might help you.

First, likelihoods are defined only to a proportionality factor and their utility comes from their use in a ratio and while they are proportional to the relevant probability, they are not probabilities. That means that if you are uncomfortable with the values in the range of 10−6510−65 then you could simply multiply them all by 10651065 without changing the ratios. Of course, there is no need to do as the ratio effectively does it for you. The likelihood ratio for the two distributions is about 25 times in favour of the 5,5 distribution over the 6,6 distribution. That would typically be thought of as being fairly strong (but not overwhelmingly strong) support by the data (and the statistical model) for the 5,5 distribution over the 6,6 distribution.

Second, I usually find a plot of the likelihood as a function of a parameter to be helpful. You have set up the system with two parameters that are effectively 'of interest' and so the relevant likelihood function would be three dimensional and thus awkward. (Those dimensions being the population mean, the standard deviation, and the likelihood values.) It would be easier for you to fix one of those parameters and explore the likelihoods as a function of the other. My justification for looking at the full likelihood function rather than a singular ratio of two selected points in parameter space is that it contains more information and it allows the data to speak with less distortion.

answered 12 hours ago

Highly active question. Earn 10 reputation (not counting the association bonus) in order to answer this question. The reputation requirement helps protect this question from spam and non-answer activity.

Not the answer you're looking for? Browse other questions tagged or ask your own question.

3 Answers

Recommend

The First Developer Preview of Android 15

2024春节档盘点：电影票房创纪录，优酷电影大屏用户时长同比激增近30%

TikTok原生版上线Vision Pro

有趣的网站小游戏推荐

Gopure在TikTok Shop凭借销售抗皱保湿产品月GMV破85万美金！

Deals: Samsung Galaxy A25 on sale on Amazon DE, along with a variety of Galaxy t...

MIT researchers demonstrate rapid liquid metal 3D printing technique

特斯拉model 3安装普诺得7kw充电桩使用体验_原创_新浪众测

搜狐视频如何保存视频到手机

微博水印怎么弄到中间详情

About Joyk