Softmax Classificationn

While binary classification deals with two possible outcomes, Softmax regression is the go-to method for classifying instances into multiple categories, such as identifying animal types in images.

Working through Softmax

In a scenario where we have classes like Dog ( $1$ ), Cat ( $2$ ), Baby Chick ( $3$ ), and None ( $0$ ). Here, $C$ is the number of classes (4 in this example above), the range of classes is: $[0, C-1]$ and for nodes in the output layer, $n^{[L]} = C$ . We first convert the class labels into a vector representation using one-hot encoding.

One Hot Encoding

We use one-hot encoding to represent our classes. This means for each class, we have a vector where one element is $1$ indicating the class, and the rest are $0$ s. It is the way we usually represent class labels in a vector. With our $Y$ vector with numbers of range $[0, C−1]$ , where $C$ is the number of classes, we perform one hot encoding to get a matrix $Y$ with dimensions $(C, m)$ . Say we have:

\mathbf{y} = \begin{bmatrix} 1 & 2 & 3 & 0 & 2 & 1 \end{bmatrix}

We convert this into:

\begin{bmatrix} 0 & 0 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ \end{bmatrix} \begin{array}{l} \text{class = 0} \\ \text{class = 1} \\ \text{class = 2} \\ \text{class = 3} \\ \end{array}

This is called "one hot" encoding because in the converted representation, exactly one element of each column is hot (meaning set to 1). To do this conversion in numpy, you might have to write a few lines of code. In TensorFlow, you can just use tf.one_hot(labels, depth, axis=0). axis=0 indicates the new axis is created at dimension 0.

Softmax Calculation

Given an input vector from the last layer $\mathbf{z}^{[L]}$ , the Softmax function calculates the probabilities as follows:

\sigma(\mathbf{z}^{[l]})_j = \frac{e^{\mathbf{z}^{[l]}_j}}{\sum_{i=1}^{C}{t}_i}

Where $t_i$ is:

t_i = e^{\mathbf{z}^{[l]}}

The denominator sums $t$ , the exponential function applied to each component of the input vector, ensuring that the softmax output is a probability distribution that sums to 1. For example, given

\mathbf{z}^{[L]} = \begin{bmatrix} 5 \\ 2 \\ -1 \\ 3 \end{bmatrix}

We compute $t$ as follows:

\mathbf{t} = \begin{bmatrix} e^5 \\ e^2 \\ e^{-1} \\ e^3 \end{bmatrix}

Then we the softmax function as follows:

\sigma(\mathbf{z}^{[L]}) = \mathbf{a}^{[L]} = \begin{bmatrix} \frac{e^5}{e^5 + e^2 + e^{-1} + e^3} \\ \frac{e^2}{e^5 + e^2 + e^{-1} + e^3} \\ \frac{e^{-1}}{e^5 + e^2 + e^{-1} + e^3} \\ \frac{e^3}{e^5 + e^2 + e^{-1} + e^3} \end{bmatrix} = \begin{bmatrix} 0.842 \\ 0.042 \\ 0.002 \\ 0.114 \end{bmatrix}

The output of Softmax gives us a vector where each element is a probability of the input belonging to one of the classes i.e. $P(\text{class} | \mathbf{x}^{(i)})$ . This result shown gives a a $0.842$ chance od being in class 0 for example. That is the highest probability, i.e. the "soft max".

Each of the $C$ values in the output layer will contain a probability of the example to belong to each of the classes.

Training the Classifier

The Softmax classifier uses a cross-entropy loss function, which aims to maximize the probability of the correct class. If the classifier is confident about the correct class, the loss is low. However, if it's unsure or wrong, the loss goes up.

Contrasted to softmax, there's an activation which is called hard max, which gets 1 for the maximum value and zeros for the others. If you are using NumPy, it's np.max over the vertical axis. It is a form of maximum likelihood estimation.

The Softmax name came from softening the values and not harding them like hard max i.e a more gentle maxing. Softmax is a generalization of logistic activation function to $C$ classes. If $C = 2$ , softmax reduces to logistic regression. The loss function used with softmax:

L(y, \hat{y}) = - \sum_{j=1}^{C}{y_j \log{\hat{y}_j}}

Here is an example. Say that we have a cat:

y = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \end{bmatrix}

And our softmax classifier outputs:

\mathbf{a}^{[L]} = \mathbf{\hat{y}} = \begin{bmatrix} 0.3 \\ 0.2 \\ 0.1 \\ 0.4 \end{bmatrix}

We can compute the loss as follows:

L(y, \hat{y}) = - \sum_{j=1}^{4}{y_j \log{\hat{y}_j}} = - \log{0.1} = 2.3

The loss function will first multiply out the incorrect classes in $\hat{y}$ , and we are left with $- \log{\hat{y}_2}$ . This means the loss function tries to make sure that the corresponding probability of that class is as high as possible (here $\hat{y}_2$ ). The cost function used with Softmax:

J(\mathbf{w}^{[1]}, \mathbf{b}^{[1]}, ...) = - \frac{1}{m} \sum_{i=1}^{m}{L(y^{(i)}, \hat{y}^{(i)})}

Also, in terms of back propagation with softmax:

d\mathbf{z}^{[L]} = \frac{\partial L}{\partial \mathbf{z}^{[L]}} = \mathbf{a}^{[L]} - \mathbf{y} = \hat{\mathbf{y}} - \mathbf{y}

Batch Normalization Machine Learning Strategies