Normalization and Regularization

Normalization of inputs is a preprocessing step. Normalization refers to the process of scaling input data so that it fits within a specific range, like $[0, 1]$ or $[-1, 1]$ , or to have a standard deviation of $1$ and a mean of $0$ .

Regularization, on the other hand, is a technique used to prevent overfitting by adding a penalty, which comes in different forms, to the loss function for complex models. This penalty . Regularization helps to reduce the model's complexity by penalizing large weights, leading to simpler models that perform better on unseen data.

While normalization and regularization are different concepts, they are both methods used to improve the training process and final model performance.

Normalizing Inputs

Unnormalised vs Normalised Inputs

Similarly to the reasoning for when we normalised for ML applications, if Deep Learning inputs aren't normalized, the model’s view of the data is skewed, making the learning slow and cumbersome. Normalized inputs make the learning process quicker and the path to the best solution straighter.

Get the training set mean:

\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}

Subtract the mean from each input to center your inputs around 0:

X := X - \mu

Get the variance of the training set:

\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)})^2

Divide your inputs by this variance to scale them:

X := \frac{X}{\sigma^2}

Use the same mean ( $\mu$ ) and variance ( $\sigma^2$ ) to scale your test set

These steps should be applied to training, dev, and testing sets (but using mean and variance of the train set). If we don't normalize the inputs, our cost function will be deep and its shape will be inconsistent (elongated) and optimizing it will take a long long time. Gradient descent will take longer as we will need a smaller learning rate.

If we normalize it the opposite will occur. The shape of the cost function will be consistent (look more symmetric like the second plot) and we can use a larger learning rate alpha - the optimization will be faster. Wherever you start, you will go more directly to the optimum, with larger steps

Regularization

Regularization is a technique used in neural networks to reduce variance (overfitting), which happens when a model performs well on training data but poorly on new, unseen data. By incorporating regularization, a neural network can reduce its variance, leading to more generalized performance. There are many different types of regularization methods.

L1 & L2 Regularization

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights. This leads to a smoother model in which the output changes more slowly as the input changes.

L1 Regularization (L1 Matrix Norm): This method encourages sparsity in the model, meaning it leads to a lot of weights being zero. It is calculated as the sum of the absolute values of the weights. Mathematically, for logistic regression, it's represented as:
$\lVert w \rVert_1 = \sum_{j=1}^{n_x} |w_j|$
L2 Regularization (L2 Matrix Norm or Frobenius Norm): This technique is often used to prevent the weights from becoming too large, which can lead to overfitting. It's calculated as the sum of the squares of all the $\mathbf{w}$ :
$\lVert w \rVert_2^2 = \sum_{j=1}^{n_x} w_j^2$
When $\mathbf{w}$ is a vector, it can also be calculated as the square euclidean norm:
$\lVert w \rVert_2^2 = \mathbf{w}^T \mathbf{w}$
In this expression, $\mathbf{w}^T$ represents the transpose of the weight matrix.

Graphically you can see that L1 ends up being that sort of diamond shape, whereas L2 ends up being a circle.

L2 and L1

L1 & L2 Regularization in Logistic Regression

In logistic regression, regularization is applied to the cost function to control overfitting. The standard cost function is:

J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)})

For L2 regularization, the cost function is modified as:

J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \lVert w \rVert_2^2

And for L1 regularization, it becomes:

J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \lVert w \rVert_1

In L1 regularization, many weights become zero, reducing the model size. L2 regularization is more commonly used. Here, $\lambda$ is the regularization parameter, a hyperparameter that needs to be tuned during model training. Set this using your dev or hold-out set.

L1 & L2 Regularization in Neural Networks

For an entire neural network, the cost function is:

J(w^{[1]}, b^{[1]}, ..., w^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^{L} \lVert w^{[l]} \rVert_F^2

Where this Frobenius norm of a matrix is defined as:

\lVert w^{[l]} \rVert_F^2 = \sum_{i=1}^{n^{[l]}} \sum_{j=1}^{n^{[l-1]}} (\mathbf{w}_{i,j}^{[l]})^2

Note that $\mathbf{w} ^ {[l]}$ is a matrix of shape $(n^{[l]}, n^{[l-1]})$ . The rows $i$ of the matrix are the number of neurons in the current layer $n^{[l]}$ whereas the columns $j$ of the weight matrix should equal the number of neurons in the previous layer $n^{[l-1]}$ .

To do back propagation, before we added the extra regularization term, we did:

$dw^{[l]}$ = (from back propagation)

The new way with Regularization:

$dw^{[l]}$ = (from back propagation) + $\frac{\lambda}{m} w^{[l]}$

In practice this penalizes large weights and effectively limits the freedom in your model, which is why this is sometimes referred to as "weight decay". We can see this if we plug it into the weight update equation:

\begin{aligned} w^{[l]} &= w^{[l]} - \alpha dw^{[l]} \\ &= w^{[l]} - \alpha \left( \frac{\partial J}{\partial w^{[l]}} + \frac{\lambda}{m} w^{[l]} \right) \\ &= w^{[l]} - \alpha \frac{\lambda}{m} w^{[l]} - \alpha \frac{\partial J}{\partial w^{[l]}} \\ &= w^{[l]} \left( 1 - \frac{\alpha \lambda}{m} \right) - \alpha \frac{\partial J}{\partial w^{[l]}} \end{aligned}

We are multiplying $w^{[l]}$ by a term $\left( 1 - \frac{\alpha \lambda}{m} \right)$ . This term is always less than 1, so it makes $w^{[l]}$ smaller. This is why it's called "weight decay" - it causes the weight to decay in proportion to its size.

This penalizes large weights, and so, encourages even distribution of your weights, and therefore reduces overfitting on just one weight. The result is that we end up reducing the impact of a lot of the hidden units, and effectively, we have a simpler network that is less prone to overfitting

Why does L1 & L2 regularization reduce overfitting?

Take the cost function:

J(w^{[1]}, b^{[1]}, ..., w^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^{L} \lVert w^{[l]} \rVert_F^2

Here are some intuition on how regularization reduces overfitting:

Intuition 1:
- If $\lambda$ is too large - a lot of $w$ 's will be close to zeros which will make the NN simpler (you can think of it as it would behave closer to logistic regression).
- If $\lambda$ is good enough it will just reduce some weights that makes the neural network overfit.
- L2 regularization makes your decision boundary smoother. If $\lambda$ is too large, it is also possible to "oversmooth", resulting in a model with high bias
Intuition 2 (with $tanh()$ activation function) :
- If $\lambda$ is too large, $w$ 's will be small (close to zero) - we will end up using the linear part of the tanh activation function, so we will go from non-linear activation to roughly linear which would make the NN a roughly linear classifier.
- If $\lambda$ is good enough it will just make some of tanh activations roughly linear which will prevent overfitting. This will mean we are not able to fit those very complicated, non-linear decision boundaries that allow our NN to overfit to data sets

Tanh

Implementation tip: if you implement gradient descent, one of the steps to debug gradient descent is to plot the cost function $J$ as a function of the number of iterations of gradient descent. You want to be able to see that the cost function $J$ decreases monotonically, i.e. in a way that only ever decreases, after every evaluation of gradient descent with regularization. If you plot the old definition of $J$ (no regularization) then you might not see it decrease monotonically.

Cost Function Decreasing

Implications of L2-regularization

Cost Computation
- A regularization term is added to the cost
Backpropagation function
- There are extra terms in the gradients with respect to weight matrices
Weights:
- Weights end up smaller ("weight decay") and are pushed to smaller values.

Comparison of L1 & L2 Regularization

L1: sum of weights
- Performs feature selection - entire features go to 0
  - Mathematically it can cause entire features to go to zero, so that regularization term can end up going to zero for a lot of terms and actually choose some features that are more important than others.
- Computationally inefficient
  - L1 involves taking the absolute values of the weights, meaning that the solution is a non-differentiable piecewise function or, put simply, it has no closed form solution. L1 regularization is computationally more expensive, because it cannot be solved in terms of matrix math.
- Sparse output - removing information
L2: sum of square of weights
- All features remain considered, just weighted
- Computationally efficient
- Denser output - not discarding anything

Why would you want L1? Due to the curse of dimensionality; L1 regularization is one way of doing that automatically, in an extreme example, out of 100 different features you have, maybe only 10 would actually end up with non-zero coefficients with L1 regulization, so the resulting sparsity see that you end up with can make up for the computational inefficiency of L1 regularization itself.

Feature selection can reduce dimensionality. Out of 100 features, maybe only 10 end up with non-zero coefficients! The resulting sparsity can make up for its computational inefficiency.

But, if you think all of your features are important, L2 is probably a better choice because that's not going to do feature selection, it's not gonna wipe out entire features by causing that regularization term to go all the way down to zero

Dropout Regularization

Dropout is another regularization technique. less commonly used than L2 regularization, but is worth knowing. It eliminates some neurons/weights on each iteration based on a probability keep_prob $p$ . A common technique to implement dropout is called "Inverted dropout".

You only use dropout during training, because we don't want to randomly eliminate nodes during test time. We apply dropout both during forward and backward propagation.

Dropout Regularization

pseudocode for Inverted dropout may look like:

keep_prob = 0.8
l = 3 # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped. 80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob
# Create a dropout vector of 0 and 1's
a3 = np.multiply(a3,d3) # keep only the values in d3
# zeroes out the corresponding values in a3
# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to solve the scaling problem
a3 = a3 / keep_prob

The dropout vector d[l] is used for forward and backward propagation and is the same for both of them, but it is different for each iteration (pass) or training example. At test time we don't use dropout. If you implement dropout at test time, it would just add noise to predictions.

During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value.

With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time, as the weights can go away at random. Therefore we get just meaningful relationships

Understanding Dropout

The intuition is that dropout randomly knocks out units in your network. So it's as if on every iteration you're working with a smaller NN, and so using a smaller NN seems like it should have a regularizing effect. Another intuition is that you can't rely on any one feature, so you have to spread out weights.

It's possible to show that dropout has a similar effect to L2 regularization.
The input layer dropout $p$ has to be near 1 because you don't want to eliminate a lot of features.
Dropout can have different $p$ per layer. If you're more worried about some layers overfitting than others, you can set a lower $p$ for some layers than others. The downside is, this gives you even more hyperparameters to search for
One other alternative might be to have some layers where you apply dropout and some layers where you don't apply dropout and then just have one hyperparameter, which is a $p$ for the layers for which you do apply dropouts.
A lot of researchers are using dropout with Computer Vision (CV) because they have a very big input size and almost never have enough data, so overfitting is the usual problem. And dropout is a regularization technique to prevent overfitting.
A downside of dropout is that the cost function $J$ is not well defined and it will be hard to debug (plot $J$ by iteration).
- To solve that you'll need to turn off dropout, set all the $p$ 's to 1, and then run the code to check that it monotonically decreases $J$ and then turn on the dropouts again.

To reduce overfitting, you might have a keep_prob that's relatively low, say 0.5, for large layers, whereas for different layers where you might worry less about overfitting, you could have a higher key prob (smaller layers). Can have a keep prob of 1 for where you don't worry about overfitting. (e.g. input or output layer)

Data augmentation

For example in computer vision data:

You can flip all your pictures horizontally; this will give you m more data instances.
You could also apply a random position and rotation to an image to get more data.

For example in OCR, you can impose random rotations and distortions to digits/letters.

Data augmentation

New data obtained using this technique isn't as good as the real independent data, but still can be used as a regularization technique.

Early stopping

In this technique we plot the training set and the dev set cost together for each iteration. At some iteration the dev set cost will stop decreasing and will start increasing. We will pick the point at which the training set error and dev set error are best (lowest training cost with lowest dev cost) and take these parameters as the best parameters.

Early stopping

Andrew prefers to use L2 regularization instead of early stopping because this technique simultaneously tries to minimize the cost function and not to overfit which contradicts the orthogonalization approach (will be discussed further) but essentially, we should prefer a separate set of tools for each task at a time. Makes things that we are trying harder to think about

An advantage of early stopping is that you don't need to search for and tune a hyperparameter like in other regularization approaches (like $\lambda$ in L2 regularization).

Model Ensembles

Similar to when we looked at ensemble learning, we can use model ensembles to reduce overfitting. Ensemble methods involve training multiple models separately and then combining their predictions. This approach can get you extra 2% performance as it reduces the generalization error.

Bias & Variance The Vanishing/Exploding Gradient Problem