Logistic Regression

Logistic regression is a statistical method for predicting binary classes. The output is a binary outcome that is a decision between two alternatives, like yes or no, win or lose, cat or no cat. Unlike linear regression, which predicts continuous values, logistic regression is used for predicting discrete outcomes.

Processing Images into Feature Vectors

In image processing, we convert image data from matrices to a feature vector, which is a flattened, one-dimensional array. This feature vector, with dimensions denoted as $n_x$ is what the model uses for training and making predictions.

Notation

$m$ : Number of training examples.
$n_x$ : Size of the input feature vector.
$n_y$ : Size of the output vector (usually 1 or 0 for binary classification).
$\mathbf{x}^{(1)}$ and $\mathbf{y}^{(1)}$ : The first input and output vectors in the dataset.

The input data $\mathbf{X}$ and output data $\mathbf{Y}$ are represented as:

\mathbf{X} = [\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, ..., \mathbf{x}^{(m)}]

\mathbf{Y} = [\mathbf{y}^{(1)}, \mathbf{y}^{(2)}, ..., \mathbf{y}^{(m)}]

Which can be visualised as:

\mathbf{X} = \begin{bmatrix} \vert & \vert & & \vert \\ \mathbf{x}^{(1)} & \mathbf{x}^{(2)} & \cdots & \mathbf{x}^{(m)} \\ \vert & \vert & & \vert \end{bmatrix}_{n_x \times m}

Using NumPy for Efficient Operations

Python's NumPy library is commonly used for handling these data structures efficiently. It allows for quick vectorized operations on large matrices, a common requirement in deep learning.

Each training example is a pair $(\mathbf{x}, \mathbf{y})$ where $\mathbf{x}$ is an $n_x$ -dimensional vector (input) and $\mathbf{y}$ is the output. To check the dimensions in Python, you can use x.shape, which would return the shape as $(n_x, m)$ . The output matrix $\mathbf{Y}$ is usually a $(1, m)$ vector, indicating binary outputs for each training example.

The Basics of Logistic Regression

In linear regression, we might use a simple equation like $y = wx + b$ , but this doesn't work well for binary classification. Instead, when we have a feature vector $\mathbf{x}$ , logistic regression predicts a binary outcome $\mathbf{\hat{y}}$ using the equation:

\mathbf{\hat{y}} = \mathbf{w}^T\mathbf{x} + b

Here, $\mathbf{w}$ is an $n_x$ dimensional vector of weights, $b$ is a real number bias term, and $\mathbf{x}$ is our input vector. The goal is to learn the best values for $\mathbf{w}$ and $b$ so that $\mathbf{\hat{y}}$ is as close as possible to the actual outcome $\mathbf{y}$ . Essentially, given $\{ (\mathbf{x}^{(1)}, \mathbf{y}^{(1)}), \ldots, (\mathbf{x}^{(m)}, \mathbf{y}^{(m)}) \}$ we want $\mathbf{\hat{y}}^{(i)} \approx \mathbf{y}^{(i)}$ .

Sigmoid Function for Probabilities

To ensure the output $\hat{y}$ is between 0 and 1, we apply the sigmoid function:

\hat{y} = \sigma(\mathbf{w}^T\mathbf{x} + b), \quad \sigma(z) = \frac{1}{1+e^{-z}}

Sigmoid Function

The sigmoid function squashes the output of the linear equation into a range between 0 and 1, making it possible to interpret the result as a probability.

ℹ️

Note that if $z$ is a large positive number, then:

\sigma(z) \approx \frac{1}{1 + 0} = 1

and if $z$ is a large negative number, then:

\sigma(z) \approx \frac{1}{1 + \text{big number}} \approx 0

Cost Function in Logistic Regression

Choosing the right cost function is critical.

To train the parameters $\mathbf{w}$ and $b$ , it is critical to choose the right cost function. The First loss function you may think to reach out for would be the mean squared error:

\mathcal{L}(\hat{y}, y) = \frac{1}{2} (\hat{y} - y)^2

Unlike linear regression, we don't use the mean squared error because it can lead to multiple local minima. Instead, we use the binary cross-entropy loss function:

\mathcal{L}(\hat{y}, y) = -(y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}))

This loss function is convex, so it has just one minimum, making it easier to optimize. To explain it, we can visualise that:

For an actual label $y$ of 1:
- Loss simplifies to $\mathcal{L}(\hat{y} , 1) = - \log(\hat{y})$
- To minimise loss, we want our predicted probability $\hat{y}$ to be the largest
- $\hat{y}$ biggest value is 1
For an actual label $y$ of 0:
- Loss simplifies to $\mathcal{L}(\hat{y}, 0) = -\log(1-\hat{y})$
- we want $1 - \hat{y}$ to be the largest
- we want $\hat{y}$ to be as low as possible, approaching 0.

To measure the performance across the entire training set, we use the cost function:

\begin{align*} J(w, b) &= -\frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) \\ &= -\frac{1}{m} \sum_{i=1}^{m} \left( y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right) \end{align*}

The cost function $J$ is the average of the loss function $\mathcal{L}$ over all $m$ training examples, where the loss function $\mathcal{L}$ is the binary cross-entropy loss.

Gradient Descent

Gradient Descent is a method used in machine learning to find the best parameters for our model, in this case, $\mathbf{w}$ and $b$ , which will minimize our cost function. We start by setting $\mathbf{w}$ and $b$ to zero. The reason we prefer zero over random numbers in logistic regression is to begin with a neutral decision boundary.

With the learning rate $\alpha$ , which controls how big a step we take in updating our parameters, we iteratively adjust $\mathbf{w}$ and $b$ . We do this using the following update rules:

For $w$ :

w := w - \alpha \frac{\partial J(w, b)}{\partial w}

And for $b$ :

b := b - \alpha \frac{\partial J(w, b)}{\partial b}

We can normally denote this simply as:

w := w - \alpha (\partial w)

b := b - \alpha (\partial b)

Here, the derivatives $\frac{\partial J(w, b)}{\partial w}$ and $\frac{\partial J(w, b)}{\partial b}$ give us the direction to change $w$ and $b$ to reduce the cost, meaning they tell us how to adjust our parameters to make our model's predictions more accurate.

Introduction Gradient Descent