In the previous post, we explored the concept of Maximum Likelihood Estimation (MLE) and its intuitive foundation. Now, let's delve deeper into the mathematics of MLE and understand how it can be applied to different probability distributions.
The Maximum Likelihood Estimator (MLE) is a method for estimating the parameter(s) of a probability distribution that maximizes the likelihood function. Given a random sample $X_1, X_2, \ldots, X_n$ drawn from a distribution $F(\cdot \mid \theta)$ with probability density function (pdf) or probability mass function (pmf) $f(\cdot \mid \theta)$, the MLE aims to find the value of $\theta$ that makes the observed sample most probable. Note that $\theta$ can be a vector of multiple parameters, for eg N$(\mu, \sigma^2)$ has $\theta = (\mu, \sigma^2)$. In Bernoulli case, $\theta = p$.
The goal is to find a quantity, which is a function of the parameters $\theta$, which tells you how well that $\theta$ is resonating with the observed sample, or how much the given distribution with a particular $\theta$ resonates with the observed sample. In the previous post for the Bernoulli$(p)$ case, the function $L(p) = \binom{8}{5} \cdot p^5 \cdot (1-p)^3 $ was that quantity. If you observe, $L(p)$ is nothing but the joint distribution of the Bernoulli random variables for the observed values, which is the product of the individual Bernoulli random variables due to the i.i.d nature of the random sample. $f_{X_i}(x_i \mid \theta)$ := pmf / pdf of $X$ with parameters $\theta$ with realization $x_i$.
Now, as you have observed the likelihood of parameter value $\theta$ should be defined as the joint distribution of the random sample, each of which follows $f(\cdot \mid \theta)$ with the observed values, which gives you the chance of observing the observed values, which we want to maximize.
Therefore, we define the likelihood $ L(\theta \mid X = x) $ as $\prod_{i = 1}^{n} f_{X_i}(x_i \mid \theta) $. However, the likelihood needs to be minimized by finding the derivative and equating it to 0. This can be troublesome because derivatives of $n$ products are cumbersome to find. So, we take the logarithm of $L(\theta)$, and optimize it instead of the likelihood, since the product changes into summation. The optimum or the place where it reaches the minimum doesn't change after taking the logarithm. We will write $L(\theta \mid X = x)$ as $L(\theta)$ in this post.
It works because $\log(x)$ is a bijective function from $(0,\infty) \rightarrow \mathbf{R}$. It is easy to show the following result.
Proof
The idea of the proof is simple. For relations $R_1, R_2 \in \{ \geq, \leq, = \}$, $x R_2 y$ implies $g(x) R_1 g(y)$ if and only if $f(g(x)) R_1 f(g(y))$. Just take the inverse of $f(\cdot)$ for the proof.

Step 1: Write down the likelihood function $L(\theta)$, which is the joint probability distribution of the observed sample.
Step 2: Take the natural logarithm of the likelihood function, yielding the log-likelihood function $l(\theta) = \log(L(\theta)) = log(\prod_{i = 1}^{n}f_{X_i}(x_i \mid \theta))$, which gives rise to $ \sum_{i = 1}^{n} log(f_{X_i}(x_i \mid \theta))$.
Step 3: Find the derivative of the log-likelihood function with respect to $\theta$ and set it equal to zero to obtain the maximum likelihood estimator $\hat{\theta}$.
Step 4: Solve for $\hat{\theta}$, either analytically or numerically, depending on the complexity of the likelihood function.
Let's apply the MLE algorithm to the Bernoulli distribution. We have a sample of $n$ Bernoulli trials ($X_1, X_2, \ldots, X_n$) with the realizations ($x_1, x_2, \ldots, x_n$), where each trial results in success (1) with probability $p$ and failure (0) with probability $1-p$. The likelihood function is:
$$ L(p) = \prod_{i=1}^{n} p^{x_i} (1-p)^{1-x_i} $$
Taking the natural logarithm, we get:$$l(p) = \sum_{i=1}^{n} \left(x_i \ln(p) + (1-x_i)\ln(1-p)\right)$$Differentiating and setting $\frac{dl}{dp} = 0$, we find:$$\frac{1}{p} \sum_{i=1}^{n} x_i - \frac{1}{1-p} \sum_{i=1}^{n} (1-x_i) = 0$$Solving for $\hat{p}$:$$\hat{p} = \frac{1}{n} \sum_{i=1}^{n} x_i$$
Since, $\frac{1}{n} \sum_{i=1}^{n} x_i$ is the average number of times we get $1$ out of $n$ trials, this is consistent with the idea that a good estimate of the probability of success is the average number of times we get successes.
Consider a sample of $n$ observations ($X_1, X_2, \ldots, X_n$) with the realizations ($x_1, x_2, \ldots, x_n$)from a normal distribution with an unknown mean $\mu$ and known variance $\sigma^2=1$. The likelihood function is:
$$L(\mu) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x_i-\mu)^2} $$
Taking the natural logarithm, we get:
$$l(\mu) = -\frac{n}{2}\ln(2\pi) - \frac{1}{2}\sum_{i=1}^{n} (x_i-\mu)^2$$
Differentiating and setting $\frac{dl}{d\mu} = 0$, we find:
$$\sum_{i=1}^{n}(x_i - \mu) = 0$$
Solving for $\hat{\mu}$:
$$\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i$$This is consistent with our intuition that the sample mean is the maximum likelihood estimator for the unknown mean of a normal distribution with known variance.
The Maximum Likelihood Estimation procedure provides a powerful framework for estimating unknown parameters in various statistical models, making it a fundamental tool in statistics and data analysis.
Negative of Log Likelihood $-log(L(\theta))$, rather $-2log(L(\theta))$ is used as a loss function, which has to be minimized in most of the probabilistic supervised learning methods to find the best parameters, which have to be estimated. This is equivalent to maximizing the log-likelihood function to find the best parameter values. $-2log(L(\theta))$ is quite special in the asymptotic sense that this follows a sweet distribution, which helps a lot in testing the hypothesis result.
In the upcoming posts, we will be solving for MLE for various other examples, and other look into special cases, where finding the optimum may not be just as simple as finding the derivative and solving the equations. Sometimes, the equations cannot be solved analytically, and we have to resort to other iterative methods to find the optimum, or sometimes the domain of optimization is tricky, and the case has to be handled carefully. Also, in certain cases, one has to make sure the function is concave so that the maximum is achieved.
Exercise: Show that for the Bernoulli case and the Normal with unknown mean case, the log-likelihood functions are concave.