Neural Style Transfer

What is neural style transfer?

Neural Style Transfer (NST) is a technique that applies the stylistic appearance of one image, known as the style image, to the content of another, known as the content image. Unlike typical training processes in machine learning, NST doesn't learn parameters but instead optimizes pixel values directly to produce artistic effects.

Artistic rendering of NST

Implementing NST requires examining both shallow and deep features extracted by a convolutional neural network (ConvNet). This process often utilizes a form of transfer learning where a ConvNet, such as VGG that has been pre-trained on a large dataset, is repurposed to merge the content and style of two separate images.

What Do Deep ConvNets Learn?

First let's visualize what a deep network is learning, given an AlexNet like ConvNets:

Pick a unit in layer $l$ and find the nine image patches that maximize the unit's activation (which image patches maximally activate that hidden units' activations). Notice that a hidden unit in layer one will see a relatively small portion of the input image; if you plotted it, it'll match a small section of the input image in the shallower layers, while it will get a larger sectioned image in deeper layers. Go on and repeat this process for other units and layers.

Once plotted, we can see that it turns out that layer $1$ (and other shallower layers) are learning the low level representations of the image like colors and edges. You will find out that each layer as you go deeper is learning more complex representations as in the deeper layers, a hidden unit will see a larger region of the image.

Layer-wise feature visualization

The first layer was created using the weights of the first layer. Other images are generated using the receptive field in the image that triggered the neuron to be max. later units are actually seen larger image patches, but we have plotted the image patches as the same size on the image above for an easy comparison.

Zeiler and Fergus., 2013, Visualizing and understanding convolutional networks

A good explanation on how to get receptive field given a layer:

Receptive field illustration

This is taken from A guide to receptive field arithmetic for Convolutional Neural Networks (opens in a new tab)

Cost Function

We will define a cost function for the generated image that measures how good it is, so we can minimize a cost function to generate the image that we want. Given a content image $C$ , a style image $S$ , and a generated image $G$ :

J(G) = \alpha \cdot J_{\text{content}}(C, G) + \beta \cdot J_{\text{style}}(S, G)

$J(C, G)$ : Measures how similar the generated image is to the Content image.
$J(S, G)$ : Measures how similar the generated image is to the Style image.
$\alpha$ , $\beta$ : Hyperparameters adjusting the relative importance of content and style.

These functions measure how similar is the contents of the generated image to the content of the content image $C$ and how similar is the style of the image $G$ to the style of the image $S$ . Finally, we'll weight these with two hyper parameters $\alpha$ and $\beta$ to specify the relative weighting between the content costs and the style cost.

Find the generated image $G$ :

Initiate $G$ $G$ randomly
- For example $G$ : $100 \times 100 \times 3$
Use gradient descent to minimize $J(G)$ $J (G)$
- $G = G - dG$
- We compute the gradient image and use gradient decent to minimize the cost function.
- Essentially, we update the pixel values.

The iterations, to generate an image based on these two may be as follows:

NST input-output transformation

NST iterative process

Where the first image iteration is randomly initialized.

The Content Cost Function

In the previous section we showed that we need a cost function for the content image and the style image to measure how similar is them to each other. Say you use hidden layer $l$ to compute content cost. If we choose $l$ to be small (like layer 1 for example), we will force the network to get a similar output to the original content image. In practice though, $l$ is chosen to be not too shallow and not too deep either; it is somewhere in the middle.

We use a pre-trained ConvNet for this (E.g., VGG network). Let $a^{[l](C)}$ and $a^{[l](G)}$ be the activation from layer $l$ on the content and Generated images respectively from the layer we have chosen to compute content cost. If $a^{[l](C)}$ and $a^{[l](G)}$ are similar then they will have similar content.

J_{\text{content}}(C, G) = \frac{1}{2} \left\| a^{[l](C)} - a^{[l](G)} \right\|^2

$\frac{1}{2}$ here is the normalization constant, although the value doesn't matter as much as it can be tuned using alpha. We then carry out the element-wise sum of squares of differences between the activations in layer $l$ , between the images in $C$ and $G$ .

Style Cost Function

What exactly is the meaning of the 'style' of an image? Say you are using layer $l$ 's activation to measure style. We define style as correlation between activations across channels. That means that given an activation like this with 5 channels:

Block of activations

We try to compute, for example, how correlated the orange channel is with the yellow channel. Correlated here means if a value appeared in a specific channel a specific value will appear too in another (Depends on each other). Uncorrelated means if a value appears in a specific channel, it doesn't necessarily mean that another value will appear in another channel (Not depend on each other).

The correlation tells you how components might or might not occur together in the same image i.e. the correlation tells you which of these high level texture components tend to occur or not occur together in part of an image. The correlation we see in the style image channels should also appear in the generated image channels. For this we use the Style matrix (Gram matrix).

In the deeper layers of a ConvNet, each channel corresponds to a different feature detector. The style matrix $G^{[l]}$ measures the degree to which the activations of different feature detectors in layer $l$ vary (or correlate) together with each other. It can be seen as a matrix of cross-correlations between the different feature detectors.

Let $a_{ijk}^{[l]}$ be the activation at $l$ with ( $i=H$ , $j=W$ , $k=C$ ). Also, $G^{(l)}$ and $G^{(S)}$ is matrix of shape $(n_c^{[l]}, n_c^{[l]})$ . We call this matrix the style matrix or Gram matrix. In this matrix each cell will tell us how correlated a channel is to another channel.

To populate the matrix, we use these equations to compute the style matrix of the style image and the generated image.

G_{kk'}^{[l](S)} = \sum_{i=1}^{N_H^{[l]}} \sum_{j=1}^{N_W^{[l]}} a_{ijk}^{[l](S)} \cdot a_{ijk'}^{[l](S)}

G_{kk'}^{[l](G)} = \sum_{i=1}^{N_H^{[l]}} \sum_{j=1}^{N_W^{[l]}} a_{ijk}^{[l](G)} \cdot a_{ijk'}^{[l](G)}

The capital $G$ stands for 'Gram Matrix' or 'Style Matrix'. Here $k$ ranges from 1 all the way to $n_c^{[l]}$ and is the number of channels in that layer $l$ . $i$ and $j$ are the height and width. As it appears, $G$ is the sum of the multiplication of each member in the matrix.

To compute gram matrix efficiently:

Reshape activation from $n_H \times n_W \times n_C$ to $n_{HW} \times n_C$
Name the reshaped activation $F$ .
$G^{[l]} = F \cdot F^T$

Finally, the cost function will be calculated as the Frobenius norm between 2 gram matrices:

\begin{align*} J_{\text{style}}^{[l]}(S, G) &= \frac{1}{(2 N_H^{[l]} N_W^{[l]} N_C^{[l]})^2} \left\| G^{[l](S)} - G^{[l](G)} \right\|^2_F \\ &= \frac{1}{(2 N_H^{[l]} N_W^{[l]} N_C^{[l]})^2} \sum_{k} \sum_{k'} \left( G_{kk'}^{[l](S)} - G_{kk'}^{[l](G)} \right)^2 \end{align*}

It turns out that you get more visually pleasing results if you use the style cost function from multiple different layers, so if you have used it from some layers

J_{\text{style}}(S, G) = \sum_{l} \lambda^{[l]} J_{\text{style}}^{[l]}(S, G)

Face Recognition 1D & 3D Generalizations