Gradient Descent

Computation graphs are visual representations of mathematical operations that go from inputs to outputs. They're particularly useful in deep learning to organize calculations. For example, consider the function:

J(a, b, c) = 3(a + b \cdot c)

In this function, we have intermediate variables:

u = b \cdot c

v = a + u

And the final output:

J = 3 \cdot v

We can represent this as:

Computation Graph

Derivatives with a Computation Graph

In a computation graph, we often want to calculate how changes in inputs affect the outputs. This is where the chain rule is key. It states that for two functions $f$ and $g$ , the derivative of heir composition $(f \circ g)(x) = f(g(x))$ with respect to $x$ is:

(f \circ g)'(x) = f'(g(x)) \cdot g'(x)

In terms of variables, if $y = f(u)$ and $u = g(x)$ , then the derivative of $y$ with respect to $x$ is given by:

\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}

Here, $\frac{dy}{du}$ is evaluated at the point $u = g(x)$ .

In the context of a computation graph, we calculate derivatives from the output back to the inputs. This ight-to-left pass that's more natural for computing derivatives is known as backpropagation in machine learning. The notation $d\text{var}$ represents the derivative of the final output with respect to various intermediate quantities.

Backpropagation Example

Take the visual representation of a computation graph with forward pass above. For the backward pass, we want to compute the derivatives of the final output $J$ with respect to the inputs $a, b, c$ and along the way we will need it in terms of the intermediate quantities $u, v$ .

Compute intermediates $dJ/dv$ and $dJ/du$ , which is the derivative of the final output $J$ with respect to $u$ and $v$ .

\frac{dJ}{dv} = 3

\frac{dJ}{du} = \frac{dJ}{dv} \cdot \frac{dv}{du} = 3 \cdot 1 = 3

Compute $dJ/da$ , which is the derivative of the final output $J$ with respect to $a$ .

\frac{dJ}{da} = \frac{dJ}{dv} \cdot \frac{dv}{da} = 3 \cdot 1 = 3

Compute $dJ/db$ , which is the derivative of the final output $J$ with respect to $b$ .

\frac{dJ}{db} = \frac{dJ}{du} \cdot \frac{du}{db} = 3 \cdot c = 3c

Compute $dJ/dc$ , which is the derivative of the final output $J$ with respect to $c$ .

\frac{dJ}{dc} = \frac{dJ}{du} \cdot \frac{du}{dc} = 3 \cdot b = 3b

We compute these derivatives going from right to left, applying the chain rule at each step.These gradients tell us how much a change in a certain variable affects the final output. This is crucial for algorithms like gradient descent, where we need to adjust parameters to minimize a cost function.

Logistic Regression Gradient Descent

Similarly take the derivatives of gradient descent example for one sample with two features:

Computation Graph Logistic Regression

$z$ is the linear combination of inputs and weights plus the bias.
$a$ is the activation computed using the sigmoid function $\sigma(z)$ .
$\mathcal{L}(a, y)$ is the loss function, which is the binary cross-entropy.
The derivatives $\frac{d\mathcal{L}}{dw_1}$ , $\frac{d\mathcal{L}}{dw_2}$ , and $\frac{d\mathcal{L}}{db}$ describe how the loss function changes with respect to each parameter.
The update rules define how to adjust the parameters $w_1$ , $w_2$ , and $b$ in the direction that minimizes the loss, using a learning rate $\alpha$

The derivatives for the Backward pass. First the derivative of the binary cross-entropy loss function:

\frac{d\mathcal{L}(a, y)}{da} = -\left(\frac{y}{a} - \frac{1 - y}{1 - a}\right)

For the derivative of the sigmoid function

\frac{da}{dz} = a \cdot (1 - a)

For $z$ we have:

\frac{d\mathcal{L}}{dz} = \frac{d\mathcal{L}}{da} \cdot \frac{da}{dz} = a - y

Now for the partial derivatives with respect to weights and bias:

\frac{d\mathcal{L}}{dw_1} = x_1 \cdot \frac{d\mathcal{L}}{dz}

\frac{d\mathcal{L}}{dw_2} = x_2 \cdot \frac{d\mathcal{L}}{dz}

\frac{d\mathcal{L}}{db} = \frac{d\mathcal{L}}{dz}

And now the update rules for weights and bias

w_1 := w_1 - \alpha \frac{d\mathcal{L}}{dw_1}

w_2 := w_2 - \alpha \frac{d\mathcal{L}}{dw_2}

b := b - \alpha \frac{d\mathcal{L}}{db}