MobileNet

MobileNet is a class of efficient models for mobile and edge devices. It addresses the need for less computationally intensive neural networks that maintain high accuracy while being able to run on devices with limited processing capabilities, like smartphones.

Depthwise Separable Convolution

The key innovation in MobileNet is the use of depthwise separable convolutions. This technique significantly reduces the computational cost without dramatically sacrificing model performance.

Normal Convolution

In standard convolution, an input volume of size $n \times n \times n_c$ is convolved with a $f \times f \times n_c$ filter to produce an output volume. The computational cost for this operation is quite high due to the large number of multiplications involved.

The computation cost here is calculated as:

= \text{no. filter parameters} \times \text{no. filter positions} \times \text{no. of filters}

For this case, it is (3x3x3) + (4x4) + (5) = 2,160.

Depthwise Separable Convolution

Depthwise separable convolution splits the convolution process into two layers:

Depthwise Convolution: Applies a single $f \times f$ filter per input channel.

The filter and depthwise convolution is going to be $f \times f$ instead of $f \times f \times n_c$ . The number of filters is going to be $n_c$ , which in this case is 3. The way that you will compute the $4 \times 4 \times 3$ output is that you apply one of each of these filters to one of each of these input channels.

The computation cost here is

= \text{no. filter parameters} \times \text{no. filter positions} \times \text{no. of filters}

The total computational cost is (3x3) x (4x4) x (3) = 432. The next step is taking this $4 \times 4 \times 3$ intermediate value and carry out one more pointwise step in order to get the output we want which will be $4 \times 4 \times 5$ .

Pointwise Convolution (1 x 1 Convolution): Changes the depth dimension by applying a $1 \times 1$ convolution.

We take the intermediate values, which is now $n \times n \times n_c$ , and convolve it with a filter that is $1 \times 1 \times n_c$ , $1 \times 1 \times 3$ in this case. Now, we've done this with just one filter but in order to get another $4 \times 4$ output, by the $4 \times 4 \times 5$ dimensional outputs, you would actually do this with $n_c'$ prime filters, in this case $n_c'$ will set to 5.

The computation cost here is

= \text{no. filter parameters} \times \text{no. filter positions} \times \text{no. of filters}

For every one of these $4 \times 4 \times 5$ output values, we had to apply this pink filter to part of the input. That cost three operations or (1x1x3) which is number of filter parameters. The filter had to be placed in (4x4) different positions, and we had five filters i.e. (1x1x3) x (4x4) x 5 = 240.

Overall, with these two methods, we ended up with a $4 \times 4 \times 5$ but the cost has reduced from 2160 to (423 + 240) = 672 for depthwise separable convolution. If we look the ratio between these two numbers, 672 over 2160, this turns out to be about 0.31. The authors of the MobileNets paper showed that in general, the ratio of the cost of the depthwise separable convolution compared to the normal convolution, is equal to:

\frac{1}{n_c'} + \frac{1}{f^2}

In the general case. In our case this was $\frac{1}{5} + \frac{1}{3^2}$ or $\frac{1}{9}$ , which is 0.31. This approach significantly reduces the computational cost while maintaining a similar output dimensionality as the normal convolution.

Computational Cost Analysis

In a more typical NN example, $n_c'$ will be much bigger. So, it may be, say, $1 \over 512$ . If you have 512 channels in your output plus $1 \over 3^2$ , this would be a fairly typical parameters or new network. So very roughly, the depthwise separable convolution may be about 10 times cheaper in computational costs.

This is why the depthwise separable convolution as a building block of a ConvNet allows you to carry out convolutions much more efficiently than using a normal convolution.

Note that in order to make the diagrams look simpler, even when the number of channels is greater than three, we will still to draw the depthwise convolution operation as if it was this stack of three filters. When you see this exact icon later, think of that as the icon we're using to denote a depthwise convolution, rather than a very literal exact visualization of the number of channels at the depthwise convolution filter. In this example, the input here would be $4 \times 4 \times n_c$ .

Overall we have learned about the depthwise separable convolution, which comprises two main steps, the depthwise convolution and the pointwise convolution. This operation can be designed to have the same inputs and output dimensions as the normal convolutional operation, but it can be done in much lower computational cost. Let's now take this building block and use it to build the MobileNet.

Building MobileNet with Depthwise Separable Convolutions

MobileNet architecture leverages depthwise separable convolutions to construct an efficient and performant network suitable for mobile deployment. It replaces expensive convolutional operations with this more efficient convolutional operation, comprising the depthwise convolution operation and the pointwise convolution operation, using it repetitively to build the network.

MobileNet V1 Architecture

MobileNet V1

The MobileNet v1 paper had a specific architecture in which it use a block like this, 13 times. It would use a depthwise convolutional operation to generate outputs and then have a stack of 13 of these layers in order to go from the original raw input image to finally making a classification prediction. This was followed by standard Pooling layer, followed by a fully connected layer, followed by a Softmax in order for it to make a classification prediction.

This turns out to perform well while being much less computationally expensive than earlier algorithms that used a normal convolutional operation.

MobileNet V2 Architecture

MobileNet V2

MobileNet V2 introduces two significant improvements:

Residual Connections: Similar to ResNets, these connections allowing gradients to propagate backward more efficiently.
Expansion Layers: These layers increase the number of channels temporarily, allowing the network to capture more complex features before projecting back to a lower dimension.

Bottleneck Blocks in MobileNet V2

MobileNet V2 Bottleneck Block

The bottleneck block allows the network to perform more complex computations without a proportional increase in memory requirements.

Here is the MobileNet v2 bottleneck block. Given an input that is say, $n \times n \times 3$ , the MobileNet v2 bottleneck will pass that input via the residual connection directly to the output, just like in the Resnet.

In this main non-residual connection parts of the block, you'll first apply an expansion operator, and what that means you'll apply lots of $1 \times 1 \times n_c$ filters. In this case, $1 \times 1 \times 3$ filter. You apply a fairly large number of them, say 18 filters, so that you end up with an $n \times n \times 18$ -dimensional block. A factor of expansion of six is quite typical in MobileNet v2 which is why your inputs goes from $n \times n \times 3$ to $n \times n \times 18$ , and that's why we call it an expansion as it increases the dimension by a factor of six.

The next step is a depthwise separable convolution. With a little bit of padding, you can then go from $n \times n \times 18$ to the same dimension. In the last section, we went from $6 \times 6 \times 3$ to $4 \times 4 \times 3$ because we didn't use padding. But with padding, you can maintain the $n \times n \times 18$ dimension, so it doesn't shrink when you apply the depthwise convolution.

Finally, you apply a pointwise convolution, which in this case means convolving with a $1 \times 1 \times 18$ -dimensional filter. If you have, say, 3 filters ( $n_c'$ filters), then you end up with an output that is $n \times n \times 3$ because you have 3 such filters. In this last step, we went from $n \times n \times 18$ down to $n \times n \times 3$ , and in the MobileNet v2 bottleneck block this last step is also called a projection step because you're projecting down from $n \times n \times 18$ down to $n \times n \times 3$ .

By using the expansion operation, it increases the size of the representation within the bottleneck block. This allows the neural network to learn a richer function. There's just more computation in the middle 2 steps. But when deploying on a mobile device, on edge device, you will often be heavy memory constraints. The bottleneck block uses the pointwise convolution or the projection operation in order to project it back down to a smaller set of values, so that when you pass this the next block, the amount of memory needed to store these values is reduced back down.

The clever idea about the bottleneck block is that it enables a richer set of computations, thus allow your neural network to learn richer and more complex functions, while also keeping the amounts of memory that is the size of the activations you need to pass from layer to layer, relatively small. That's why the MobileNet v2 can get a better performance than MobileNet v1, while still continuing to use only a modest amount of compute and memory resources.

Conclusion

MobileNets represent a pivotal development in the deployment of neural networks on mobile devices. By utilizing depthwise separable convolutions and innovative architectural choices, they offer a balance between efficiency and performance, making advanced machine learning applications more accessible on devices with limited hardware capabilities.

Szegedy et al., 2014, "Going Deeper with Convolutions"; MobileNet V1 and V2 papers.

Inception (GoogleNet)EfficientNet