Gradient Descent
Computation graphs are visual representations of mathematical operations that go from inputs to outputs. They're particularly useful in deep learning to organize calculations. For example, consider the function:
In this function, we have intermediate variables:
And the final output:
We can represent this as:
Derivatives with a Computation Graph
In a computation graph, we often want to calculate how changes in inputs affect the outputs. This is where the chain rule is key. It states that for two functions and , the derivative of heir composition with respect to is:
In terms of variables, if and , then the derivative of with respect to is given by:
Here, is evaluated at the point .
In the context of a computation graph, we calculate derivatives from the output back to the inputs. This ight-to-left pass that's more natural for computing derivatives is known as backpropagation in machine learning. The notation represents the derivative of the final output with respect to various intermediate quantities.
Backpropagation Example
Take the visual representation of a computation graph with forward pass above. For the backward pass, we want to compute the derivatives of the final output with respect to the inputs and along the way we will need it in terms of the intermediate quantities .
- Compute intermediates and , which is the derivative of the final output with respect to and .
- Compute , which is the derivative of the final output with respect to .
- Compute , which is the derivative of the final output with respect to .
- Compute , which is the derivative of the final output with respect to .
We compute these derivatives going from right to left, applying the chain rule at each step.These gradients tell us how much a change in a certain variable affects the final output. This is crucial for algorithms like gradient descent, where we need to adjust parameters to minimize a cost function.
Logistic Regression Gradient Descent
Similarly take the derivatives of gradient descent example for one sample with two features:
- is the linear combination of inputs and weights plus the bias.
- is the activation computed using the sigmoid function .
- is the loss function, which is the binary cross-entropy.
- The derivatives , , and describe how the loss function changes with respect to each parameter.
- The update rules define how to adjust the parameters , , and in the direction that minimizes the loss, using a learning rate
The derivatives for the Backward pass. First the derivative of the binary cross-entropy loss function:
For the derivative of the sigmoid function
For we have:
Now for the partial derivatives with respect to weights and bias:
And now the update rules for weights and bias
Gradient Descent on Examples
Take the cost function for logistic regression which is the average of the binary cross-entropy loss over all training examples:
We can also calculate the average gradient across all examples. The derivative of with respect to is:
Once you've computed the derivatives of the cost function with respect to each parameter, we update the parameters and to minimize the cost function .