Artificial Intelligence 🤖
Common Continuous Distributions

Common Continuous Distributions

We will now analyse three important probability distributions for continuous random variables, particular those involved in engineering and statistics: the Exponential and Normal distributions.

Uniform distribution

Let's start off with a really simple example: uniform distribution. A uniform distribution just means there's a flat constant probability of a value occurring within a given range. we can create a uniform distribution by using the NumPy random.uniform function.

Unifiorm Distribution

There's pretty much an equal chance of any given value or range of values occurring within that data. So, unlike the normal distribution, where we saw a concentration of values near the mean, a uniform distribution has equal probability across any given value within the range that you define.

So what would the probability distribution function of this look like? Well, I'd expect to see basically nothing outside of the range of 10\mathbf{- 1 0} or beyond 10\mathbf{1 0}. But when I'm between 10\mathbf{- 1 0} and 10, I would see a flat line because there's a constant probability of any one of those ranges of values occurring. So in a uniform distribution you would see a flat line on the probability distribution function because there is basically a constant probability. Every value, every range of values has an equal chance of appearing as any other value.

Normal (Gaussian) Distribution

To visualise in python:

from scipy.stats import norm
import matplotlib.pyplot as plt
 
x = np.arrange(-3, 3, 0.001)
plt.plot(x, norm.pdf(x))

Which results in:

Normal Python

We continue our discussion of continuous distributions by considering perhaps the most important probability distribution of them all: the Normal distribution. We will begin with a general definition for the Normal distribution, which we shall see has two parameters μ\mu and σ\sigma that exactly define E[X]E[X] and Var[X]\operatorname{Var}[X], respectively.

The random variable XX, follows a Normal distribution with parameters μ\mu and σ2\sigma^{2} (where σ>0\sigma>0 ), denoted by

XN(μ,σ)X \sim N(\mu, \sigma)

if its density function is given by

fX(xμ,σ)=12πσ2e(xμ)2/2σ2f_{X}(x \mid \mu, \sigma)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-(x-\mu)^{2} / 2 \sigma^{2}}

The parameters μ\mu and σ2\sigma^{2} are the mean (E[X])(E[X]) and variance (Var[X])(\operatorname{Var}[X]) of XX, respectively.

At first glance this expression appears rather complex, but we will now introduce an important transformation that will greatly simplify our calculations.

Standardising and the Z-transformation

We start by considering the following important property of the Normal distribution:

If XN(μ,σ2)X \sim N\left(\mu, \sigma^{2}\right), then for any constants aa and b(b0)b(b \neq 0), the random variable Y=a+bXY=a+b X is also normally distributed.

Using our properties from Chapter 2, the expressions for the expectation and variance of YY are easily computed as:

E[Y]=E[a+bX]=a+bE[X]=a+bμVar[Y]=Var[a+bX]=b2Var[X]=b2σ2\begin{aligned} E[Y] & =E[a+b X]=a+b E[X]=a+b \mu \\ \operatorname{Var}[Y] & =\operatorname{Var}[a+b X]=b^{2} \operatorname{Var}[X]=b^{2} \sigma^{2} \end{aligned}

From which we deduce that YN(a+bμ,b2σ2)Y \sim N\left(a+b \mu, b^{2} \sigma^{2}\right) (a formal derivation of this is omitted to save space but can be found in almost any Sheldon Ross book on Probability).

Recall that the expression Y=a+bXY=a+b X holds for any values of aa and b0b \neq 0 ! What we would like to do is find the values for aa and bb such that E[Y]=0E[Y]=0 and Var[Y]=1\operatorname{Var}[Y]=1; we shall refer to such a distribution N(0,1)N(0,1) as the standard Normal. Solving for aa and bb we find:

E[Y]=a+bμ=0a=bμ=μσVar[Y]=b2σ2=1b=1σ\begin{aligned} & E[Y]=a+b \mu=0 \quad \Rightarrow \quad a=-b \mu=-\frac{\mu}{\sigma} \\ & \operatorname{Var}[Y]=b^{2} \sigma^{2}=1 \quad \Rightarrow \quad b=\frac{1}{\sigma} \end{aligned}

Plugging in a=μ/σa=-\mu / \sigma and b=1/σb=1 / \sigma into our expression Y=a+bXY=a+b X we find that Y=XμσY=\frac{X-\mu}{\sigma}; we refer to this as the Z-transformation: Any Normal random variable XN(μ,σ)X \sim N(\mu, \sigma) can be equivalently expressed as ZN(0,1)Z \sim N(0,1) using the Z-transformation:

ZXμσZN(0,1):E[Z]=0,Var[Z]=1\begin{gathered} Z \equiv \frac{X-\mu}{\sigma} \\ Z \sim N(0,1): \quad E[Z]=0, \quad \operatorname{Var}[Z]=1 \end{gathered}

Which simplifies the Normal PDF fX(xμ,σ)f_{X}(x \mid \mu, \sigma) to the standard Normal PDF:

fZ(z)=12πez2/2f_{Z}(z)=\frac{1}{\sqrt{2 \pi}} e^{-z^{2} / 2}

Figure 17: Standard Normal PDFfZ(z)=12πez2/2\operatorname{PDF} f_{Z}(z)=\frac{1}{\sqrt{2 \pi}} e^{-z^{2} / 2} with Var[Z]=1\operatorname{Var}[Z]=1 (and also a standard deviation of 1 since standard deviation σ=Var[Z])\sigma=\sqrt{\operatorname{Var}[Z]}) and E[Z]=0E[Z]=0. To compute probabilities from this PDF, we will use look up tables to find the value of the CDFFZ(z)=zfZ(u)du\operatorname{CDF} F_{Z}(z)=\int_{-\infty}^{z} f_{Z}(u) d u for a given value zz, as illustrated for z=2z=-2.

In other words, to standardise a Normal random variable, we subtract its mean and then divide by the square root of its variance (which recall we defined as the standard deviation). Notice the parameters (μ,σ)(\mu, \sigma) have been removed in the standard Normal PDF\mathrm{PDF} (i.e. fZ(z)f_{Z}(z) is not conditioned on any model parameters)! We shall see that is useful property allows us to map any Normal distribution onto the standard Normal, which we only have to compute once and store in a lookup table.

Expectation and Variance:

To show that E[X]=μE[X]=\mu, we integrate using the Z-transformation z=(xμ)/σz=(x-\mu) / \sigma, so that dx=σdzd x=\sigma d z. The results is:

E[X]=xfX(x)dx=x12πσ2e(xμ)2/2σ2dx=(σz+μ)12πσ2ez2/2σdz=σ2πzez2/2dz+μ12πez2/2dz=μ\begin{aligned} E[X] & =\int_{-\infty}^{\infty} x f_{X}(x) d x=\int_{-\infty}^{\infty} x \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-(x-\mu)^{2} / 2 \sigma^{2}} d x=\int_{-\infty}^{\infty}(\sigma z+\mu) \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-z^{2} / 2} \sigma d z \\ & =\frac{\sigma}{\sqrt{2 \pi}} \int_{-\infty}^{\infty} z e^{-z^{2} / 2} d z+\mu \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi}} e^{-z^{2} / 2} d z=\mu \end{aligned}

The final integral evaluates to 1 by definition because we are integrating the standar Normal PDF fZ(z)f_{Z}(z) over the entire range.

Next we can evaluate E[X2]E\left[X^{2}\right] to show that Var[X]=σ2\operatorname{Var}[X]=\sigma^{2}, which requires integration by parts.

udv=uvvdu\int u d v=u v-\int v d u

We leave it to the reader to prove this.

Computing Probabilities using the Normal CDF

The CDF of the Normal has no explicit closed form. However, as we have just noted above, we can compute probabilities involving Normally distributed random variables by standardising them (i.e. using the Z\mathrm{Z} transformation). That is:

FX(x)=P(Xx)=P(Xμσxμσ)=P(Zxμσ)=FZ(xμσ)F_{X}(x)=P(X \leq x)=P\left(\frac{X-\mu}{\sigma} \leq \frac{x-\mu}{\sigma}\right)=P\left(Z \leq \frac{x-\mu}{\sigma}\right)=F_{Z}\left(\frac{x-\mu}{\sigma}\right)

Where the standard Normal CDFFZ(z)=P(Zz)\operatorname{CDF} F_{Z}(z)=P(Z \leq z) is defined as:

FZ(z)=zfZ(u)du=z12πeu2/2duF_{Z}(z)=\int_{-\infty}^{z} f_{Z}(u) d u=\int_{-\infty}^{z} \frac{1}{\sqrt{2 \pi}} e^{-u^{2} / 2} d u

Thus, using a precomputed table of values of the standard Normal CDF FZ(z)F_{Z}(z) (sometimes also denoted as Φ(z)\Phi(z) in other textbooks), we can obtain the CDF for the general Normal of interest. Indeed, every probability text book will present these as tables in the appendices (a table is currently posted on BBl\mathrm{BBl} ).

Lastly, this procedure can be generalised for computing the CDF over an interval [a,b][a, b] for any XX \sim N(μ,σ)N(\mu, \sigma) :

P(aXb)=P(aμσXμσbμσ)=P(aμσZbμσ)=P(Zbμσ)P(Zaμσ)=FZ(bμσ)FZ(aμσ)\begin{aligned} P(a \leq X \leq b) & =P\left(\frac{a-\mu}{\sigma} \leq \frac{X-\mu}{\sigma} \leq \frac{b-\mu}{\sigma}\right) \\ & =P\left(\frac{a-\mu}{\sigma} \leq Z \leq \frac{b-\mu}{\sigma}\right) \\ & =P\left(Z \leq \frac{b-\mu}{\sigma}\right)-P\left(Z \leq \frac{a-\mu}{\sigma}\right) \\ & =F_{Z}\left(\frac{b-\mu}{\sigma}\right)-F_{Z}\left(\frac{a-\mu}{\sigma}\right) \end{aligned}

Where again the CDFFZ(z)\mathrm{CDF} F_{Z}(z) can be obtained from the area under of curve from -\infty to zz of the PDF fZ(z)f_{Z}(z) as illustrated in Figure 17 for z=2z=-2.

Approximation of the Binomial Distribution

You might have already noticed by now that under certain conditions some of the previous distributions we have considered have a symmetric, bell-shape to them (e.g. see Figures 13 and 15 for binomial and Poisson, respectively). Indeed, the standard normal (N(0,1))(N(0,1)) was proposed in 1733 by DeMoivre to approximate the binomial distribution for p=0.5p=0.5 (and was later generalised for any pp by Laplace in 1812). Recall that for the binomial distribution:

E[X]=npVar[X]=np(1p)\begin{aligned} E[X] & =n p \\ \operatorname{Var}[X] & =n p(1-p) \end{aligned}

Thus, we can standardise XX as:

Z=Xnpnp(1p)Z=\frac{X-n p}{\sqrt{n p(1-p)}}

We can approximate the distribution of a binomial random variable XX, where pp is the probability of "success", for large nn as:

limnP[aXnpnp(1p)b]=12πabez2/2dz\lim _{n \rightarrow \infty} P\left[a \leq \frac{X-n p}{\sqrt{n p(1-p)}} \leq b\right]=\frac{1}{\sqrt{2 \pi}} \int_{a}^{b} e^{-z^{2} / 2} d z

For any numbers aa and bb. Note that this approximation is even a bit more "flexible" than the Poisson approximation of the binomial distribution in that it can provide a decent approximation for relatively small nn (we will consider an example shortly for p=0.5p=0.5 and n=20n=20 ). It should be noted that we can improve the accuracy of the approximation of the discrete distribution by making the following continuity correction:

limnP[aXnpnp(1p)b]=12πa0.5b+0.5ez2/2dz\lim _{n \rightarrow \infty} P\left[a \leq \frac{X-n p}{\sqrt{n p(1-p)}} \leq b\right]=\frac{1}{\sqrt{2 \pi}} \int_{a-0.5}^{b+0.5} e^{-z^{2} / 2} d z

This will become clear with the consideration of an example.

Example: Normal Approximation to Binomial Distribution of Ebola Virus

Recall our example of modelling the number of survivors of the Ebola Virus as XBinomial(n=X \sim \operatorname{Binomial}(n= 15,p=0.5)15, p=0.5). If 15 people are infected, we had computed the probability of more than ten survivors to be P(X10)=0.150879P(X \geq 10)=0.150879. Let's consider a Normal approximation to this distribution.

Solution:

First we need to standardise the values for XX based on E[X]=np=7.5E[X]=n p=7.5 and Var[X]=np(1p)=3.75\operatorname{Var}[X]=n p(1-p)=3.75 :

X=(X=10)7.53.75=1.29099X^{\prime}=\frac{(X=10)-7.5}{\sqrt{3.75}}=1.29099

Now we integrate the standard normal curve as follows:

P[1.29099X7.53.75]=12π1.29099ez2/2dzP\left[1.29099 \leq \frac{X-7.5}{\sqrt{3.75}}\right]=\frac{1}{\sqrt{2 \pi}} \int_{1.29099}^{\infty} e^{-z^{2} / 2} d z

Figure 18: Normal approximation to the binomial distribution for the number of survivors of the Ebola Virus. To evaluate the probability of 10 or more survivors using the Normal approximation, we can integrate the Normal curve for X10X \geq 10 (red line), or using the continuity correction for X9.5X \geq 9.5 (blue line), which turns out to be a much better approximation of the binomial probability.

Numerical evaluation or table lookup of this integral results in P(X10)=0.0983529P(X \geq 10)=0.0983529, which is a substantial underestimation of the binomial model. However, if we apply the continuity correction:

X=(X=9.5)7.53.75=1.0328P[1.0328X7.53.75]=12π1.0328ez2/2dz\begin{gathered} X^{\prime}=\frac{(X=9.5)-7.5}{\sqrt{3.75}}=1.0328 \\ P\left[1.0328 \leq \frac{X-7.5}{\sqrt{3.75}}\right]=\frac{1}{\sqrt{2 \pi}} \int_{1.0328}^{\infty} e^{-z^{2} / 2} d z \end{gathered}

we get a much more accurate estimation: P(X9.5)=0.15085P(X \geq 9.5)=0.15085. The quality of the fit of the Normal approximation to the binomial distribution is shown in Figure 18.

Example: Normal Approximation of Binomial: ESP Research

This next example of the Normal approximation to the Binomial distribution follows from a more serious treatment by William Feller (Feller, W., "Statistical Aspects of ESP.", Journal of Parapsychology, 4: 271-298, 1940).

In 1938, to test whether extrasensory perception (ESP) exists, researchers Pratt and Woodruff at Duke University conducted a study using 32 students. For this purpose, the investigator and the student sit at opposite ends of a table. The investigator would shuffle a deck of cards containing 5 'ESP symbols' (see Figure 19), select a card and concentrate on it. The student would then try to guess the identity of the card (i.e. guess which of the 5 symbols the investigator was concentrating on). For a comical rendition of this experiment, see the opening scene of the film Ghostbusters (1984).

In the Pratt and Woodruff experiment, the 32 students made a total of 60,000 guesses, which were found to be correct 12,489 times (i.e. x=12,489x=12,489 ). Since there were only 5 ESP symbols, we could estimate p=15p=\frac{1}{5}, and assuming this to be a series of independent Bernoulli trials (i.e. a binomial distribution), our expected number of successful guesses would be E[X]=np=60,00015=12,000E[X]=n p=60,000 \cdot \frac{1}{5}=12,000. Can we conclude from this study that the additional 489 correct outcomes proves that ESP exists?

Solution:

We can compute the probability of observing 12,489 or more guesses to be correct using the binomial distribution as follows:

P(X12,489)=x=12,48960,000(60,000x)(15)x(45)60,000xP(X \geq 12,489)=\sum_{x=12,489}^{60,000}\left(\begin{array}{c} 60,000 \\ x \end{array}\right)\left(\frac{1}{5}\right)^{x}\left(\frac{4}{5}\right)^{60,000-x}

Clearly the binomial coefficient in this problem cannot be computed easily! Thus, we shall make use of the Normal approximation to the binomial distribution, which should be very accurate in this case given that n=60,000n=60,000. Using the continuity correction, we have:

X=(X=12,488.5)12,00012,00004/5=4.99P[4.99X12,00012,00004/5]=12π4.99ez2/2dz=0.0000003=3.0E7\begin{gathered} X^{\prime}=\frac{(X=12,488.5)-12,000}{\sqrt{12,0000 \cdot 4 / 5}}=4.99 \\ P\left[4.99 \leq \frac{X-12,000}{\sqrt{12,0000 \cdot 4 / 5}}\right]=\frac{1}{\sqrt{2 \pi}} \int_{4.99}^{\infty} e^{-z^{2} / 2} d z=0.0000003=3.0 E-7 \end{gathered}

Interestingly, the probability of observing this many guesses (or more) to be correct is so low that it suggests this could not have been due to chance! However, it should be noted that this particular ESP experiment has been met with a great deal of skepticism in the scientific community.

Figure 19: ESP cards (also known as 'Zener Cards') used in the ESP experiments.

Comments: Even with the continuity correction, the Normal approximation of the binomial can be poor if nn is too small. General conditions for applying the approximation are if n>9p1pn>9 \frac{p}{1-p} and n>91ppn>9 \frac{1-p}{p}.

Central Limit Theorem

Interestingly, every binomial random variable XX can be expressed as the sum of nn independent Bernoulli random variables X1,X2,,XnX_{1}, X_{2}, \ldots, X_{n}, where Xi=1X_{i}=1 with a probability pp and Xi=0X_{i}=0 with a probability 1p1-p (think for a moment why this is so; we actually used this property to derive the expectation of a binomial random variable given the result for the Bernoulli distribution). That is:

X=i=1nXiX=\sum_{i=1}^{n} X_{i}

Therefore the Normal approximation to the binomial distribution can also be written as:

limnP[a(X1+X2++Xn)npnp(1p)b]=12πabez2/2dz\lim _{n \rightarrow \infty} P\left[a \leq \frac{\left(X_{1}+X_{2}+\ldots+X_{n}\right)-n p}{\sqrt{n p(1-p)}} \leq b\right]=\frac{1}{\sqrt{2 \pi}} \int_{a}^{b} e^{-z^{2} / 2} d z

This raises the question if this limit applies to the sums of other types of random variables? Astonishingly, it does! Indeed significant efforts were made over the next two centuries to extend this result to other distributions. In 1920, this was generalised by G. Polya as the Central Limit Theorem:

Suppose that the random variables Y1,Y2,,YnY_{1}, Y_{2}, \ldots, Y_{n} are independent and from the same distribution, with mean E[Y]=μE[Y]=\mu and variance Var[Y]=σ2<\operatorname{Var}[Y]=\sigma^{2}<\infty (i.e. E[Y]\overline{E[Y]} and Var[Y]\operatorname{Var}[Y] are finite). We shall refer to this collection of variables as a random sample but defer further discussion until the Statistics portion of the course. The Central Limit Theorem (CLT) states for any numbers aa and bb that:

limnP[a(Y1+Y2++Yn)E[i=1nYi]Var[i=1nYi]b]=limnP[a(Y1+Y2++Yn)nμnσb]=12πabez2/2dz\begin{aligned} \lim _{n \rightarrow \infty} P\left[a \leq \frac{\left(Y_{1}+Y_{2}+\ldots+Y_{n}\right)-E\left[\sum_{i=1}^{n} Y_{i}\right]}{\sqrt{\operatorname{Var}\left[\sum_{i=1}^{n} Y_{i}\right]}} \leq b\right] & = \\ \lim _{n \rightarrow \infty} P\left[a \leq \frac{\left(Y_{1}+Y_{2}+\ldots+Y_{n}\right)-n \mu}{\sqrt{n} \sigma} \leq b\right] & =\frac{1}{\sqrt{2 \pi}} \int_{a}^{b} e^{-z^{2} / 2} d z \end{aligned}

As the sample size increases, the distribution of the sum converges to the Normal, regardless of the original distribution of the YiY_{i}.

In the above derivation, we have made use of the simple fact that:

E[i=1nYi]=i=1nμ=nμVar[i=1nYi]=i=1nσ2=nσ2\begin{aligned} E\left[\sum_{i=1}^{n} Y_{i}\right] & =\sum_{i=1}^{n} \mu=n \mu \\ \operatorname{Var}\left[\sum_{i=1}^{n} Y_{i}\right] & =\sum_{i=1}^{n} \sigma^{2}=n \sigma^{2} \end{aligned}

You will commonly encounter the CLT presented in terms of the average of Y1,Y2,,YnY_{1}, Y_{2}, \ldots, Y_{n} (denoted as Yˉ\bar{Y} ) rather than the sum. Again, we can simply compute the expectation and variance of the average as:

E[Yˉ]=E[1ni=1nYi]=1n(i=1nμ)=1nnμ=μVar[Yˉ]=Var[1ni=1nYi]=1n2(i=1nσ2)=1n2nσ2=σ2n\begin{aligned} E[\bar{Y}] & =E\left[\frac{1}{n} \sum_{i=1}^{n} Y_{i}\right]=\frac{1}{n}\left(\sum_{i=1}^{n} \mu\right)=\frac{1}{n} n \mu=\mu \\ \operatorname{Var}[\bar{Y}] & =\operatorname{Var}\left[\frac{1}{n} \sum_{i=1}^{n} Y_{i}\right]=\frac{1}{n^{2}}\left(\sum_{i=1}^{n} \sigma^{2}\right)=\frac{1}{n^{2}} n \sigma^{2}=\frac{\sigma^{2}}{n} \end{aligned}

Plugging in these expressions for E[Yˉ]E[\bar{Y}] and Var[Yˉ]\operatorname{Var}[\bar{Y}] results in the following form for the Central Limit Theorem for the average Yˉ\bar{Y} :

limnP[aYˉμσ/nb]=12πabez2/2dz\lim _{n \rightarrow \infty} P\left[a \leq \frac{\bar{Y}-\mu}{\sigma / \sqrt{n}} \leq b\right]=\frac{1}{\sqrt{2 \pi}} \int_{a}^{b} e^{-z^{2} / 2} d z

The above representation of the Central Limit Theorem is extremely important and will be used throughout the Statistics section of the course. It is equivalent to stating that as the sample size nn becomes large, the average of any random sample of random variables (Y1,Y2,,Yn)\left(Y_{1}, Y_{2}, \ldots, Y_{n}\right) becomes Normally distributed:

YˉN(μ,σ2/n)\bar{Y} \rightarrow N\left(\mu, \sigma^{2} / n\right)

Comments: Although the Central Limit Theorem has been specified in the limit of large nn, it is surprisingly accurate for even small values of nn. Generally speaking, if the sample has come from a symmetric PDF then the approximation will converge quickly. Having said this, recall the shape of the exponential PDF, fW(wλ^)=λ^eλ^wf_{W}(w \mid \hat{\lambda})=\hat{\lambda} e^{-\hat{\lambda} w}, as presented in Figure 16, which is clearly not symmetric. None-the-less, the CLT still applies to this distribution, albeit for much greater values of nn (we will see some examples of this in lecture).

Exponential (Power) Distribution

An important distribution that has applications in modelling aspects of process safety (discussed in the 'Reliability' section of this course) is the Exponential distribution, which has a close relationship with the discrete Poisson distribution just discussed. In general, we can use something like scipy.stats.expon to create an exponential distribution.

from scipy.stats import expon
import matplotlib.pyplot as plt
 
x = np.arange(0, 10, 0.001)
plt.plot(x, expon.pdf(x))

Which will result in:

Exponential distribution

Let us consider again the Poisson Process for modelling bus arrivals (see Figure 14). While the Poisson distribution is used to compute the probability of specific numbers of arrivals during a particular period TT, we might also be interested in computing the "waiting time" between each bus, as shown by t1t_{1} and tkt_{k} in Figure 14. It is intuitive that the observations of waiting times will take on a range of continuous values, and thus we will need to model them using a continuous random variable and probability density function (PDF). However, we shall see that the resulting PDF utilises the same parameter, λ^\hat{\lambda}, as the Poisson Process!

Suppose that we interested in events that occur independently over time, and at a known average rate. Let λ^\hat{\lambda} be the rate per unit time, then the waiting time between occurrences, WW, follows an exponential distribution. We denote this by

WExponential(λ^)W \sim \operatorname{Exponential}(\hat{\lambda})

In other words: If the number of arrivals in a time interval is Poisson distributed, then the time between arrivals is exponentially distributed.

Derivation of Exponential PDF from Poisson PMF:

To derive the exponential PDF in relation to the Poisson PMF, let's say we are interested in computing the probability that our waiting time is greater than some time TT. We will refer to the continuous random variable of waiting time as WW, so that we do not confuse it with the discrete random variable XX in the Poisson PMF. Thus, we seek the probability: P(W>T)P(W>T). We can define the waiting time as W>TW>T if and only if x=0x=0 in our Poisson PMF fX(xλ^,T)f_{X}(x \mid \hat{\lambda}, T) for a given parameter TT (i.e. no buses have arrived in the interval TT ), which leads to:

P(W>T)=fX(x=0λ^,T)=eλ^T(λ^T)00!=eλ^TP(W>T)=f_{X}(x=0 \mid \hat{\lambda}, T)=\frac{e^{-\hat{\lambda} T}(\hat{\lambda} T)^{0}}{0 !}=e^{-\hat{\lambda} T}

We can also generically state for any random variable WW that:

P(WT)+P(W>T)=1FW(T)=P(WT)\begin{gathered} P(W \leq T)+P(W>T)=1 \\ F_{W}(T)=P(W \leq T) \end{gathered}

for some CDFFW(w=T)\operatorname{CDF} F_{W}(w=T). If we combine the previous three equations, we can explicitly solve for FW(w=T)F_{W}(w=T) as:

FW(T)=1P(W>T)=1eλ^TF_{W}(T)=1-P(W>T)=1-e^{-\hat{\lambda} T}

What this tells us is that FW(w)=1eλ^wF_{W}(w)=1-e^{-\hat{\lambda} w}. Recalling the relationship between a PDF and CDF (fW(w)=ddwFW(w))\left(f_{W}(w)=\frac{d}{d w} F_{W}(w)\right), we can now define the PDF of the exponential distribution, where WW is our continuous random variable for waiting times:

fW(wλ^)={λ^eλ^w if w>00 otherwise f_{W}(w \mid \hat{\lambda})= \begin{cases}\hat{\lambda} e^{-\hat{\lambda} w} & \text { if } w>0 \\ 0 & \text { otherwise }\end{cases}

Analogously, we could have worked from a definition of fW(w)f_{W}(w) to derive the CDF. For a value u>0u>0,

P(Wu)=ufW(wλ^)dw=0uλ^eλ^wdw=[eλ^w]0u=1eλ^uP(W \leq u)=\int_{-\infty}^{u} f_{W}(w \mid \hat{\lambda}) d w=\int_{0}^{u} \hat{\lambda} e^{-\hat{\lambda} w} d w=\left[-e^{-\hat{\lambda} w}\right]_{0}^{u}=1-e^{-\hat{\lambda} u}

so the complete form of the CDF is

FW(wλ^)={1eλ^w if w>00 otherwise F_{W}(w \mid \hat{\lambda})= \begin{cases}1-e^{-\hat{\lambda} w} & \text { if } w>0 \\ 0 & \text { otherwise }\end{cases}

One of the greatest uses of the exponential CDF, as we have just highlighted, is in computing the probability that the waiting time will exceed some value TT :

P(W>T)=1FW(w=Tλ^)=eλ^TP(W>T)=1-F_{W}(w=T \mid \hat{\lambda})=e^{-\hat{\lambda} T}

Let's highlight some interesting properties of the exponential PDF fW(wλ^)f_{W}(w \mid \hat{\lambda}) using the example of London buses.

Example: Exponential Distribution of Waiting Times for London Buses

Recall the '49 Bus' which is scheduled to arrive every ten minutes (λ^=110)\left(\hat{\lambda}=\frac{1}{10}\right), and let's see how we can compute some queries relating to expected waiting times WW.

(a) What is the probability that you will have to wait for more than five minutes for a bus to arrive?

Solution: We see the query P(W>T)P(W>T), which we found in the derivation above to be the Poisson distribution evaluated at fX(x=0λ^,T)f_{X}(x=0 \mid \hat{\lambda}, T) (e.g. no arrivals at time point TT ) and also 1FW(w=Tλ^)1-F_{W}(w=T \mid \hat{\lambda}) :

P(W>T)=1FW(w=Tλ^)=eλ^T=e0.1×5=0.606531\begin{aligned} P(W>T)= & 1-F_{W}(w=T \mid \hat{\lambda}) \\ & =e^{-\hat{\lambda} T}=e^{-0.1 \times 5}=0.606531 \end{aligned}

(b) What is the probability you will have to wait for additional ten minutes, given that you have already waited for five minutes and no bus has arrived?

Solution:

Using the definition of our waiting time, we seek to compute the conditional query P(W>15W>5)P(W>15 \mid W>5). As this is a conditional probability, it can be expressed as:

P(W>T2W>T1)=P(W>T2,W>T1)P(W>T1)=P(W>T2)P(W>T1)( for any T2T1)P\left(W>T_{2} \mid W>T_{1}\right)=\frac{P\left(W>T_{2}, W>T_{1}\right)}{P\left(W>T_{1}\right)}=\frac{P\left(W>T_{2}\right)}{P\left(W>T_{1}\right)}\left(\text { for any } T_{2} \geq T_{1}\right)

Inserting the probabilities in the above expression:

P(W>T2W>T1)=P(W>T2)P(W>T1)=eλ^T2eλ^T1=eλ^(T2T1)\begin{aligned} P\left(W>T_{2} \mid W>T_{1}\right) & =\frac{P\left(W>T_{2}\right)}{P\left(W>T_{1}\right)} \\ & =\frac{e^{-\hat{\lambda} T_{2}}}{e^{-\hat{\lambda} T_{1}}}=e^{-\hat{\lambda}\left(T_{2}-T_{1}\right)} \end{aligned}

Thus P(W>T2W>T1)=eλ^(T2T1)P\left(W>T_{2} \mid W>T_{1}\right)=e^{-\hat{\lambda}\left(T_{2}-T_{1}\right)}, which is equivalent to P(W>(T2T1))P\left(W>\left(T_{2}-T_{1}\right)\right) :

P(W>T2W>T1)=P(W>(T2T1))P\left(W>T_{2} \mid W>T_{1}\right)=P\left(W>\left(T_{2}-T_{1}\right)\right)

In our example, this means that the probability of waiting ten more minutes for the bus after already waiting for five minutes is equivalent to waiting ten minutes in the first place! Figure 16 illustrates how P(W>10)P(W>10) is computed from the PDFfW(wλ^)\operatorname{PDF} f_{W}(w \mid \hat{\lambda}) :

P(W>15W>5)=P(W>10)=0.367879P(W>15 \mid W>5)=P(W>10)=0.367879

Such processes that are independent from the past are called memoryless processes. Indeed, the exponential distribution is the only continuous distribution which has this memoryless property.

Let's take on last look at the PDF of waiting times between occurrences, fW(wλ^)f_{W}(w \mid \hat{\lambda}), for the Poisson model of bus arrivals, as shown in Figure 16. It is interesting to note that the exponential distribution implies that the most common separation between Poisson events are the shortest ones. For example, in the bus arrival model, we do not expect the buses to arrive in equally spaced time intervals (as is advertised at the bus stops), but rather in clusters during certain time intervals. This phenomena bears a strong resemblance to the phrase "things come in threes", where people often refer to good or bad things seemingly happening in clusters of three events.

Expectation and Variance:

Finding the expectation of WW requires integration by parts

E[W]=wfW(wλ^)dw=0wλ^eλ^wdw=0w(eλ^w)dw=[weλ^w]0+0(w)eλ^wdw=0+0eλ^wdw=[1λ^eλ^w]0=1λ^(0e0)=1λ^\begin{aligned} E[W] & =\int_{-\infty}^{\infty} w f_{W}(w \mid \hat{\lambda}) d w=\int_{0}^{\infty} w \hat{\lambda} e^{-\hat{\lambda} w} d w=-\int_{0}^{\infty} w\left(e^{-\hat{\lambda} w}\right)^{\prime} d w \\ & =-\left[w e^{-\hat{\lambda} w}\right]_{0}^{\infty}+\int_{0}^{\infty}(w)^{\prime} e^{-\hat{\lambda} w} d w=-0+\int_{0}^{\infty} e^{-\hat{\lambda} w} d w \\ & =\left[-\frac{1}{\hat{\lambda}} e^{-\hat{\lambda} w}\right]_{0}^{\infty}=-\frac{1}{\hat{\lambda}}\left(0-e^{0}\right)=\frac{1}{\hat{\lambda}} \end{aligned}

To find E[W2]E\left[W^{2}\right], and hence the variance of WW, we need to integrate by parts twice. The end result is

Var[W]=1λ^2\operatorname{Var}[W]=\frac{1}{\hat{\lambda}^{2}}

Figure 16: Exponential PDF fW(wλ^)f_{W}(w \mid \hat{\lambda}) for the times between bus arrivals, WW, given λ^=0.1\hat{\lambda}=0.1 arrivals/minute. Also shown is the query P(W>10)=1FW(w=10λ^)=0.367879P(W>10)=1-F_{W}(w=10 \mid \hat{\lambda})=0.367879.