Artificial Intelligence πŸ€–
Learning Approaches
Multi-task learning

Multi-task Learning

In transfer learning, you have a sequential process where you learn from task A and then transfer that to task B. In multi-task learning, you start off simultaneously, trying to have one neural network do several things at the same time. And then each of these tasks helps hopefully all of the other tasks.

For example, let's say you want to create an object recognition system that can identify different things on the road, like pedestrians, cars, stop signs, and traffic lights. In multi-task learning, you'd set up the system to recognize all these things simultaneously. YY shape will be (4,m)(4, m) because we have 4 classes. Then:

Cost=1mβˆ‘i=1mβˆ‘j=14L(y^j(i),yj(i))\text{Cost} = \frac{1}{m} \sum_{i=1}^{m}\sum_{j=1}^{4} L(\hat{y}^{(i)}_j, y^{(i)}_j)

Where:

L=βˆ’yj(i)log⁑(y^j(i))βˆ’(1βˆ’yj(i))β‹…log⁑(1βˆ’y^j(i))L = -y^{(i)}_j \log(\hat{y}^{(i)}_j) - (1 - y^{(i)}_j) \cdot \log(1 - \hat{y}^{(i)}_j)

In the last example you could have trained 4 neural networks separately but if some of the earlier features in neural network can be shared between these different types of objects, then you find that training one neural network to do four things results in better performance than training 4 completely separate neural networks to do the four tasks separately. Multi-task learning will also work if YY isn't complete for some labels. For example:

Y=[1?1…001…?1?…]Y = \begin{bmatrix} 1 & ? & 1 & \dots \\ 0 & 0 & 1 & \dots \\ ? & 1 & ? & \dots \\ \end{bmatrix}

you can still train your learning algorithm to do four tasks at the same time, even when some images have only a subset of the labels. In the sum over j from 1 to 4, you would sum only over values of j with a 0 or 1 label. Note that this is NOT SOFTMAX - Softmax would be a good choice if one and only one of the possibilities (stop sign, speed bump, pedestrian crossing, green light and red light) was present in each image.

And in this case it will do good with the missing data, just the loss function will be different:

Loss=1mβˆ‘allΒ jΒ whereΒ yj(i)β‰ ?L(y^j(i),yj(i))\text{Loss} = \frac{1}{m} \sum_{\text{all } j \text{ where } y^{(i)}_j \neq ?} L(\hat{y}^{(i)}_j, y^{(i)}_j)

Multi-task learning makes sense:

  1. Training on a set of tasks that could benefit from having shared lower-level features.
  2. Usually, amount of data you have for each task is quite similar.
    • If you did it individually, you may have insufficient data for a given task as opposed to when it is done simultaneously
  3. The network is large enough to learn everything well.

If you can train a big enough NN, the performance multi-task learning can be more effective than training separate networks for each task. Although transfer learning is more popular these days, multi-task learning still has its place when you're trying to learn more than one thing at a time.