Artificial Intelligence 🤖
Logistic Regression

Logistic Regression

Logistic regression is a statistical method for predicting binary classes. The output is a binary outcome that is a decision between two alternatives, like yes or no, win or lose, cat or no cat. Unlike linear regression, which predicts continuous values, logistic regression is used for predicting discrete outcomes.

Processing Images into Feature Vectors

In image processing, we convert image data from matrices to a feature vector, which is a flattened, one-dimensional array. This feature vector, with dimensions denoted as nxn_x is what the model uses for training and making predictions.

Notation

  • mm: Number of training examples.
  • nxn_x: Size of the input feature vector.
  • nyn_y: Size of the output vector (usually 1 or 0 for binary classification).
  • x(1)\mathbf{x}^{(1)} and y(1)\mathbf{y}^{(1)}: The first input and output vectors in the dataset.

The input data X\mathbf{X} and output data Y\mathbf{Y} are represented as:

X=[x(1),x(2),...,x(m)]\mathbf{X} = [\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, ..., \mathbf{x}^{(m)}] Y=[y(1),y(2),...,y(m)]\mathbf{Y} = [\mathbf{y}^{(1)}, \mathbf{y}^{(2)}, ..., \mathbf{y}^{(m)}]

Which can be visualised as:

X=[x(1)x(2)x(m)]nx×m\mathbf{X} = \begin{bmatrix} \vert & \vert & & \vert \\ \mathbf{x}^{(1)} & \mathbf{x}^{(2)} & \cdots & \mathbf{x}^{(m)} \\ \vert & \vert & & \vert \end{bmatrix}_{n_x \times m}

Using NumPy for Efficient Operations

Python's NumPy library is commonly used for handling these data structures efficiently. It allows for quick vectorized operations on large matrices, a common requirement in deep learning.

Each training example is a pair (x,y)(\mathbf{x}, \mathbf{y}) where x\mathbf{x} is an nxn_x-dimensional vector (input) and y\mathbf{y} is the output. To check the dimensions in Python, you can use x.shape, which would return the shape as (nx,m)(n_x, m). The output matrix Y\mathbf{Y} is usually a (1,m)(1, m) vector, indicating binary outputs for each training example.

The Basics of Logistic Regression

In linear regression, we might use a simple equation like y=wx+by = wx + b, but this doesn't work well for binary classification. Instead, when we have a feature vector x\mathbf{x}, logistic regression predicts a binary outcome y^\mathbf{\hat{y}} using the equation:

y^=wTx+b\mathbf{\hat{y}} = \mathbf{w}^T\mathbf{x} + b

Here, w\mathbf{w} is an nxn_x dimensional vector of weights, bb is a real number bias term, and x\mathbf{x} is our input vector. The goal is to learn the best values for w\mathbf{w} and bb so that y^\mathbf{\hat{y}} is as close as possible to the actual outcome y\mathbf{y}. Essentially, given {(x(1),y(1)),,(x(m),y(m))} \{ (\mathbf{x}^{(1)}, \mathbf{y}^{(1)}), \ldots, (\mathbf{x}^{(m)}, \mathbf{y}^{(m)}) \} we want y^(i)y(i)\mathbf{\hat{y}}^{(i)} \approx \mathbf{y}^{(i)}.

Sigmoid Function for Probabilities

To ensure the output y^\hat{y} is between 0 and 1, we apply the sigmoid function:

y^=σ(wTx+b),σ(z)=11+ez\hat{y} = \sigma(\mathbf{w}^T\mathbf{x} + b), \quad \sigma(z) = \frac{1}{1+e^{-z}}

Sigmoid Function

The sigmoid function squashes the output of the linear equation into a range between 0 and 1, making it possible to interpret the result as a probability.

ℹ️

Note that if zz is a large positive number, then:

σ(z)11+0=1\sigma(z) \approx \frac{1}{1 + 0} = 1

and if zz is a large negative number, then:

σ(z)11+big number0\sigma(z) \approx \frac{1}{1 + \text{big number}} \approx 0

Cost Function in Logistic Regression

Choosing the right cost function is critical.

To train the parameters w\mathbf{w} and bb, it is critical to choose the right cost function. The First loss function you may think to reach out for would be the mean squared error:

L(y^,y)=12(y^y)2\mathcal{L}(\hat{y}, y) = \frac{1}{2} (\hat{y} - y)^2

Unlike linear regression, we don't use the mean squared error because it can lead to multiple local minima. Instead, we use the binary cross-entropy loss function:

L(y^,y)=(ylog(y^)+(1y)log(1y^))\mathcal{L}(\hat{y}, y) = -(y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}))

This loss function is convex, so it has just one minimum, making it easier to optimize. To explain it, we can visualise that:

  • For an actual label yy of 1:

    • Loss simplifies to L(y^,1)=log(y^)\mathcal{L}(\hat{y} , 1) = - \log(\hat{y})
    • To minimise loss, we want our predicted probability y^\hat{y} to be the largest
    • y^\hat{y} biggest value is 1
  • For an actual label yy of 0:

    • Loss simplifies to L(y^,0)=log(1y^)\mathcal{L}(\hat{y}, 0) = -\log(1-\hat{y})
    • we want 1y^1 - \hat{y} to be the largest
    • we want y^\hat{y} to be as low as possible, approaching 0.

To measure the performance across the entire training set, we use the cost function:

J(w,b)=1mi=1mL(y^(i),y(i))=1mi=1m(y(i)log(y^(i))+(1y(i))log(1y^(i)))\begin{align*} J(w, b) &= -\frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) \\ &= -\frac{1}{m} \sum_{i=1}^{m} \left( y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right) \end{align*}

The cost function JJ is the average of the loss function L\mathcal{L} over all mm training examples, where the loss function L\mathcal{L} is the binary cross-entropy loss.

Gradient Descent

Gradient Descent is a method used in machine learning to find the best parameters for our model, in this case, w\mathbf{w} and bb, which will minimize our cost function. We start by setting w\mathbf{w} and bb to zero. The reason we prefer zero over random numbers in logistic regression is to begin with a neutral decision boundary.

With the learning rate α\alpha, which controls how big a step we take in updating our parameters, we iteratively adjust w\mathbf{w} and bb. We do this using the following update rules:

For ww:

w:=wαJ(w,b)ww := w - \alpha \frac{\partial J(w, b)}{\partial w}

And for bb:

b:=bαJ(w,b)bb := b - \alpha \frac{\partial J(w, b)}{\partial b}

We can normally denote this simply as:

w:=wα(w)w := w - \alpha (\partial w) b:=bα(b)b := b - \alpha (\partial b)

Here, the derivatives J(w,b)w\frac{\partial J(w, b)}{\partial w} and J(w,b)b\frac{\partial J(w, b)}{\partial b} give us the direction to change ww and bb to reduce the cost, meaning they tell us how to adjust our parameters to make our model's predictions more accurate.