Estimators and Point Estimation

This section investigates methods for estimating values for model parameters using sample observations. Specifically, a statistic is referred to as an estimator if we use it as a value for $\theta$ ; we shall denote this estimator as $\hat{\theta}$ . It is important to note here that $\hat{\theta}$ is a random variable.

Bias-Variance Decomposition

At this point, we are going to digress a bit and discuss some important properties of point estimators. First, we define the Mean Squared Error (MSE) of an estimator $\hat{\theta}$ of a parameter $\theta$ to be:

M S E(\hat{\theta})=E\left[(\hat{\theta}-\theta)^{2}\right]

Starting from the definition of MSE, we can expand and contract the quadratic terms to derive a useful expression:

\begin{aligned} & \operatorname{MSE}(\hat{\theta})=\left(E\left[\hat{\theta}^{2}\right]-E[\hat{\theta}]^{2}\right)+(E[\hat{\theta}]-\theta)^{2} \\ & =\operatorname{Var}(\hat{\theta})+\operatorname{Bias}(\hat{\theta})^{2} \end{aligned}

We can interpret each of the terms as follows:

\begin{aligned} \operatorname{Bias}(\hat{\theta}) & =(E[\hat{\theta}]-\theta)=0 \text { if 'unbiased' } \\ \operatorname{Var}(\hat{\theta}) & =\text { how sensitive an estimate is to randomness inherent in the data } \end{aligned}

An estimator $\hat{\theta}$ of a parameter $\theta$ is said to be consistent if

M S E(\hat{\theta}) \rightarrow 0 \text { as } n \rightarrow \infty

A consistent estimator converges to the parameter as the sample size increases.

Maximum-Likelihood Estimation (MLE)

Maximum-Likelihood Estimation, or 'MLE' for short, is an extremely important technique in probabilistic modelling as it provides a framework for estimating parameters for any model from a given sample $D$ . The main idea behind this approach is that even though we cannot estimate the true parameter of the population correctly (because we cannot sample the whole population), we can do no better than to find the parameter of the distribution that is most likely to have generated the sample. Given some sample dataset $D=\left\{D_{1}, D_{2}, \ldots, D_{n}\right\}$ , we can define a likelihood function $L(\theta)$ as the probability that the sample data $D$ was generated given some value of the model parameter $\theta$ :

L(\theta)=P(D \mid \theta)

Since we assume our samples $D$ are independent and identically distributed (i.i.d.), this greatly simplifies to:

L(\theta)=P(D \mid \theta)=\prod_{i=1}^{n} f_{X}\left(D_{i} \mid \theta\right)

We define the Maximum-Likelihood Estimate (MLE) of a parameter $\theta$ to be $\hat{\theta}_{\mathrm{MLE}}$ given by:

\hat{\theta}_{\mathrm{MLE}}=\underset{\theta}{\arg \max } L(\theta)

Note: In order to be succinct, we will use $\hat{\theta}$ to refer to $\hat{\theta}_{\mathrm{MLE}}$ throughout this section.

Obtaining $\hat{\theta}$ is often a straightforward process that involves writing down the likelihood function $L(\theta)$ , and then finding the value for $\theta$ that maximises it. Often it is also beneficial to verify that the estimate is indeed a maximum. In this course we will only be concerned with likelihood functions that can be directly maximised using methods from calculus, however, we note that more advanced techniques (e.g. nonlinear optimisation) may be required to obtain the MLE estimates in general.

Example: Bernoulli MLE

We will highlight the steps behind the process for obtaining a Maximum-Likelihood Estimate of $\theta$ using the Bernoulli model of flipping a coin; here the model parameter we are estimating is $\theta=p$ (i.e. the probability a coin flip is heads), as seen in Table 1.

Writing Down the Likelihood Function Here we seek to estimate the probability that flipping a coin comes up heads (model parameter $p$ ). Let's say we toss the coin twenty times $(n=20)$ and produce the following sample $D$ :

D=(H, H, T, T, T, H, H, T, T, T, T, H, T, T, H, T, T, T, H, H)

We know that each coin flip is a Bernoulli trial, with the probability of heads equal to $\theta=p$ (we shall casually refer to this model parameter as $\theta$ in the subsequent steps to illustrate the general process). Formally, we can express this as $f_{X}\left(D_{i}=H \mid \theta\right)=\theta$ , where $D_{i}=H$ implies a 'heads' is observed in the $i^{\text {th }}$ coin flip in the sequence.

It is often beneficial to apply zero/one encoding to our Bernoulli RV. That is use $D_{i}=1$ if the $i^{\text {th }}$ coin flip was heads and 0 if it was tails. The complete probability mass function for the Bernoulli distribution can then be written as:

f_{X}\left(D_{i} \mid \theta\right)=\theta^{D_{i}}(1-\theta)^{1-D_{i}}

Since we perform each coin flip independently (i.e., i.i.d.), the total probability of observing $D$ (given the true parameter of a coin is $\theta$ ) can be expressed as a product of each of the individual trials:

L(\theta)=P(D \mid \theta)=f_{X}\left(D_{1} \mid \theta\right) \times f_{X}\left(D_{2} \mid \theta\right) \times . . \times f_{X}\left(D_{n} \mid \theta\right)=\prod_{i=1}^{n} f_{X}\left(D_{i} \mid \theta\right)

Thus, the individual terms in likelihood function are simply the PMF or PDF $f_{X}\left(D_{i} \mid \theta\right)$ under consideration evaluated at $D_{i}$ . Note that this is always true for independent identically distributed (i.i.d.) experiments by definition and simplifies the overall calculations significantly. After we substitute in our expression for $f_{X}\left(D_{i} \mid \theta\right)$ and re-arrange the equation, the expression for the likelihood of $D$ given $\theta$ for this coin flip example becomes:

\begin{aligned} L(\theta) & =\prod_{i=1}^{n} f_{X}\left(D_{i} \mid \theta\right)=\prod_{i=1}^{n} \theta^{D_{i}}(1-\theta)^{1-D_{i}} \\ & =\theta^{\sum_{i=1}^{n} D_{i}}(1-\theta)^{\sum_{i=1}^{n}\left(1-D_{i}\right)}=\theta^{N(H)}(1-\theta)^{N(T)} \end{aligned}

Where $N(H)$ and $N(T)$ are the total number of heads and tails observed in $D$ respectively, and $n=$ $N(H)+N(T)$ .

Maximising the Likelihood Function We can now optimise the expression of $L(\theta)$ to obtain a MLE for $\theta$ . In practice, however, it is advantageous to optimise the logarithm of this expression instead, i.e. the so-called log-likelihood function, $\mathcal{L}(\theta)=\log L(\theta)$ , as it transforms the series of multiplications into a summation, which results in a much easier function to optimise:

\mathcal{L}(\theta)=\log L(\theta)=N(H) \log (\theta)+N(T) \log (1-\theta)

From calculus we know that a function is at a minimum, maximum or a saddle point if and only if it's derivative is zero. We can use this to find the maximum of the log-likelihood function $\mathcal{L}(\theta)$ . However we need to note that our maximisation is constrained in this example by the fact that $\theta \in[0,1]$ .

We can perform the maximisation to obtain the MLE in the coin flip case as follows:

\begin{array}{r} \frac{d \mathcal{L}}{d \theta}=0 \\ \frac{N(H)}{\theta}-\frac{N(T)}{1-\theta}=0 \\ (1-\theta) N(H)-\theta N(T)=0 \quad(\text { assuming } \theta \neq 0, \theta \neq 1) \\ \theta(N(H)+N(T))=N(H) \end{array}

Finally, we get that:

\hat{\theta}=\frac{N(H)}{N(H)+N(T)}=\frac{N(H)}{n}=\frac{1}{n} \sum_{i} D_{i}

That is, our maximum-likelihood estimate for the probability of the coin coming up heads is equal to the total number of heads observed in our sample data divided by the sample size $n$ , which is a relativelyintuitive way to estimate this probability. For the particular sample $D$ in this example, we find that $\hat{\theta}=\frac{8}{20}=0.4$ . We should also note that the last expression above is that of the sample mean, which makes sense since $E[X]=p=\theta$ for the Bernoulli distribution, and we have just seen that this estimate improves with increasing sample size $n$ (see Figure 29).

Verifying the Solution We can further use calculus to verify that the solution for number of heads we obtained is indeed the maximum by computing the second derivative of the function. If the parameter $\hat{\theta}$ is a local maximum then the second derivative of $\mathcal{L}$ at that point will be negative. We therefore want to verify:

\mathcal{L}^{\prime \prime}(\theta=\hat{\theta})<0

Which is true in our case as $\mathcal{L}^{\prime \prime}(\hat{\theta})=-\frac{N(H)}{\hat{\theta}^{2}}-\frac{N(T)}{(1 \hat{-} \theta)^{2}}$ is always negative for any $\hat{\theta} \in(0,1)$ .

Thus, the procedure for computing the maximum-likelihood estimate of a model parameter can be summarised as:

Maximum-Likelihood Estimation (MLE) procedure for an i.i.d. sample $D$ :

Write down the likelihood function $L(\theta)=\prod_{i=1}^{n} f_{X}\left(D_{i} \mid \theta\right)$ , and take the logarithm of this function: $\mathcal{L}(\theta)=\log L(\theta)$
Maximise the log-likelihood function $\mathcal{L}(\theta)$ with respect to $\theta$ to obtain $\hat{\theta}$
Verify that the obtained $\hat{\theta}$ is indeed the maximum and within the correct range for $\theta$ .

Let's see how this MLE procedure for estimating model parameters from sample data applies to another distribution from Table 1: The Poisson distribution.

Example: Maximum Likelihood Estimate for a Poisson Random Variable

Given a dataset $D=\left(D_{1}, D_{2}, \ldots, D_{n}\right)$ of i.i.d. samples drawn from the Poisson distribution, derive the expression for the Maximum-Likelihood Estimate of $\lambda$ .

Solution:

We approach this problem using the three-step procedure described above. The first step is to write down the likelihood function. By the definition of a Poisson random variable, each of the samples in the dataset have the probability:

f_{X}\left(D_{i} \mid \lambda\right)=\frac{\lambda^{D_{i}}}{D_{i} !} e^{-\lambda}

The likelihood for the entire dataset $D$ is therefore:

L(\lambda)=\prod_{i=1}^{n} f_{X}\left(D_{i} \mid \lambda\right)=\prod_{i=1}^{n} \frac{\lambda^{D_{i}}}{D_{i} !} e^{-\lambda}

We can factor and rearrange the likelihood above as follows:

\begin{aligned} L(\lambda) & =\prod_{i=1}^{n} \frac{1}{D_{i} !} \prod_{i=1}^{n} \lambda^{D_{i}} e^{-\lambda} \\ & =\left(\lambda^{\sum_{i=1}^{n} D_{i}}\right)\left(e^{-n \lambda}\right) \prod_{i=1}^{n} \frac{1}{D_{i} !} \end{aligned}

Taking the natural logarithm of the above, we the following expression for the log-likelihood:

\mathcal{L}(\lambda)=\ln L(\lambda)=\ln (\lambda)\left[\sum_{i=1}^{n} D_{i}\right]-n \lambda+\ln \left[\prod_{i=1}^{n} \frac{1}{D_{i} !}\right]

We can now directly maximise $\mathcal{L}$ , for $\lambda \geq 0$ :

\begin{aligned} \frac{d \mathcal{L}}{d \lambda} & =0 \\ \frac{\sum_{i=1}^{n} D_{i}}{\lambda}-n & =0 \quad(\text { assuming } \lambda \neq 0) \end{aligned}

\lambda=\hat{\theta}=\frac{\sum_{i=1}^{n} D_{i}}{n}

Therefore our MLE of $\lambda$ is the sample mean once again!

We can verify that this is indeed the maximum by taking the second derivative of $\mathcal{L}$ and checking that it is negative at $\lambda=\hat{\theta}$ :

\mathcal{L}^{\prime \prime}(\lambda=\hat{\theta})=-\frac{\sum_{i=1}^{n} D_{i}}{\lambda^{2}}

Which is clearly negative for all values of $\hat{\theta}=\lambda$ .

Remarks on MLE Maximum-likelihood estimation is a standard illustration of frequentist statistics. While the estimated parameter approaches the correct population parameter as the sample size $n$ approaches infinity, the estimates might deviate from the actual population parameter value for smaller sample sizes. None-the-less, an important advantage of maximum-likelihood estimation is that it produces consistent estimators (See Section 5.2.1; proof omitted here).

For instance, we estimated the $\hat{\theta}$ to be equal to 0.4 for the $n=20$ coin-flip dataset used in this chapter, even though we used a fair coin to generate it. We were equally as likely to obtain the value 0.6 as well. In order to use statistics correctly, we need to be aware that parameter estimates are themselves stochastic, and be prepared to deal with the underlying uncertainty of these estimations. We therefore also need to consider interval estimation and statistical hypothesis testing as potential ways of dealing with this inherent randomness.

Statistics and Sampling Distributions MLE of a Normal Distribution: Sample Mean and Sample Variance