Maximum Likelihood Demystified

In this post, we are going to examine how Maximum Likelihood Estimators work, how to make one of our own and how to use it.

What is Maximum Likelihood Estimation?

It is a statistical technique that strives to maximize the probability of occurrence of a given observation sequence by choosing an appropriate model. The maximum likelihood principle states that the desired probability distribution is the one that makes the observed data 'most likely'. For example, suppose we have 2 models A and B, and a given observation sequence D. Further, let model A claim that using it, the probability of observing the sequence D is 0.7, and let B claim the probability to be 0.8. Then, by the ML Principle, we will select the model which claims the observation sequence to be more likely, that is, we will use the model B.

Visualization of MLE in coin tossing experiment

(Image Credits)

For example, let \(\mathbb{D}\) be an observation sequence consisting of '\(n\)' observations. We seek a probabilistic model that best explains the observation sequence. Thus, if we have a family of models specified by model parameter vector \(\underline{\theta} = {(\theta_1, \theta_2 \ldots \theta_k)}^T\) where \(k\) is a positive real number, then, we seek to find a model with parameters \(\underline{\hat{\theta}}\) which maximize the likelihood of the observation sequence\(\mathbb{D}\).

Therefore, the Maximum Likelihood Objective may be stated as:

\(\begin{equation} \label{eq:1} \arg\max_{\underline{\theta}} P(\underline{D}|\underline{\theta}) \end{equation}\)

Here, \(P(\underline{D}| \underline{\theta})\)is the probability of occurrence of sequence \(\mathbb{D}\) given that \(\underline{\theta}\) model is chosen as the driving model.

To better understand this, let us take a few simple but more concrete examples:

In a coin tossing experiment, a coin was tossed '\(n\)' times, and the following sequence(s) was/were observed.

\(n=1, \mathbb{D}=\{H\}\) where H stands for Heads and T for tails.

\(n=3, \mathbb{D}=\{H,H,T\}\)

We wish to find a model that best explains these observation sequences.

So, first we need to understand the aspects of the experiment. A coin has 2 faces, one is called 'Heads' while the other is called 'Tails'. Tossing the coin yields only one of these two outcomes. If you were seeing a coin for the first time, like, if you were some alien landing on earth, you would not know what outcome to expect. Because you know absolutely nothing about the coin, you conduct the experiment of actually tossing it! So, for a single coin toss, the sample space is just these two outcomes, \(\mathbb{S} = \{H,T\}\).

As an aside (and as an earthling), we know that coins can have varied behaviour. By virtue of its construction, it can be heavier on one side (loaded coin) when we say that the coin has a bias towards one particular outcome. As students of science, we also know that the coin flipping and its outcome are all governed by physical laws. So, we might setup elaborate experiment to study the angle of initial hold, flip velocity, wind velocity, air pressure and other parameters and accurately determine the outcome. But that would be much ado about nothing. It is just a coin flip! Instead, we just want to know in a 'general sense of things' that 'If I flip this coin, what are the chances that I get a Heads?'. We neatly tuck all these things under the bias umbrella and try to give our answer based on that one parameter (for simplicity's sake).

Clearly, the alien cannot answer that! He doesn't know anything about it. So, we'd like to see how his notion of bias varies. What he does know is Maximum Likelihood Estimation. So, here is how he proceeds:

Step 1: Assume a model that we believe to best encapsulate the observations

This involves setting up a probability density function. For this simple case, we believe that only the bias affects the outcome. So, let \(\theta\) be the bias in the coin, such that, \(P\{H|\theta\} = \theta\). Since the sum of probabilities of all disjoint events in a sample space must sum to 1, we have \(P\{T|\theta\}= 1-\theta\).

Now, our p.d.f should encapsulate all the information and parameters. Let \(x\)denote our random variable which takes a value for each outcome. For mathematical convenience, let:

\(x = 1 \text{ on Heads (success)}\\ x = 0 \text{ on Tails (failure)}\)

So, our p.d.f may be given as \(f(x|\theta) = \theta^x(1-\theta)^{1-x}\)

We could have taken different assignments for our random variable, but we chose this particular form because it has one simplifying property. One of the two parts \(\theta^x\) and \(\theta^{1-x}\) disappears when evaluating for Heads or tails. So, it neatly simplifies to \(P\{H|\theta\} = \theta\) for our case.

Step 2: Define a Likelihood function

Note that the p.d.f. was defined for a single observation only. For the entire observation sequence, a joint p.d.f must be defined. However, we believe that previous outcomes to not affect future outcomes. Consequently, we can assume the independence of each observation. Then, the Likelihood function is defined as:

\(\begin{equation} \mathcal{L}(\theta|\mathbb{D}) = P(\mathbb{D}|\theta) = P\{x_1,x_2,x_3\ldots,x_n | \theta\} \\ = P(x_1|\theta)P(x_2|\theta)\ldots P(x_n|\theta) = \prod_{i=1}^{n}\theta^{x_i}(1-\theta)^{(1-x_i)} \end{equation}\)

Note that \(\mathcal{L}(\theta|\mathbb{D})\) is usually not a probability density function. For example, it may not even exist in some cases![1][2]

Step 3: Maximize the Likelihood function

One may proceed as one likes to maximize this objective function. We know that the product is easily converted to a sum by taking its logarithm. So, instead of maximizing this product, we can maximize the log likelihood function and still get the same answer [3].

Log likelihood may be written as:

\(ln\mbox{ }P(\mathbb{D}|\theta) = \sum_{i=1}^{n}\{x_iln(\theta)+ (1-x_i)ln(1-\theta)\}\)

Now, taking a derivate and setting it to 0 gets us our candidates for a relative maxima or minima and we may find its nature by a double derivative [4]. It is easily shown that the above value is maximized for:

\(\begin{equation} \label{eq:2} \theta = \frac{\sum_{i=1}^{n}{x_i}}{n} \end{equation}\)

From our simplified analysis also, we could proceed and find the same result.

For instance, in part 1, \(\mathcal{L}(\theta|\mathbb{D}) = P(H|\theta) = \theta\). Here, it will be maximized for \(\theta = 1\) .if we used the above generalized formula, we'd obtain the same result.

Then in part 2, \(\mathcal{L}(\theta|\mathbb{D}) = P(H,H,T|\theta) = \theta^2(1-\theta)\). This is maximized for \(\theta = 2/3\) which is in agreement with the formula in the above equation.

From the alien's point of view, the maths is right. But from our earthling perspective, don't you think that the guesses are a bit too extreme? Particularly, if a string of heads is observed, we are more prone to consider it as an instance of good luck rather than proclaiming that the coin will always yield Heads. We always like to keep some margin for error. Particularly, we would like to believe that the world is usually a fair place, and the coins also are usually fair. So, we would heavily down-weight the possibility of the coin being totally loaded. In such cases, we would like to codify our prior beliefs about the coin in our problem formulation. This is also where we start to depart from MLE towards a Bayesian approach (of MAP). This is one of the shortcomings of MLE. Take our first example again. Let us add another observation to D, and then reestimate the probabilities with model A and B. Let A state the new probability to be 0.72, and B now have a reduced probability of 0.71. In this case, a mere change of 1 observation would make us switch models. That seems counter-intuitive, no?

However, this does not mean that MLE is not useful. It is extensively used in our mobile phones to decide what code was transmitted by the base station tower. We'd cover that sometime later.

P.S.: Here's an interesting trivia: You Can Load a Die, But You Can’t Bias a Coin

[1] Wikipedia Talk: Likelihood function

[2] What is the reason that a likelihood function is not a pdf?

[3] C. M. Bishop, "Pattern Recognition and Machine Learning"

[4] In Jae Myung, "Tutorial on maximum likelihood estimation"