Learning-AI

Which Tasks Should Be Learned Together in Multi-task Learning?

March 2020

tl;dr: Answers this question: which tasks are better trained with others?

Overall impression

The paper uses the dataset from CVPR 2018 best paper taskonomy, which studies the task transferability. Task grouping studies the multi-task learnability. The paper founds that they correlate inversely.

The goal of multitask learning is two-fold.

find best performance (with the regularization power from training with other tasks)
reduce inference time

One of the key insights from the paper is:

The inclusion of an additional task in a network can potentially improve the accuracy that can be achieved on the existing tasks, even though the performance of the added task might be poor.

Key ideas

Optimal grouping is better than single multi-task network, or multiple single-task network.
- For example, the best strategy found by this paper is train 2.5 networks, 2 full-sized networks with 2 tasks each and the third half-sized network train the fifth task. However the fifth task is needed to regularize the first two full-sized network to gain optimal performance for the first four tasks.
Given enough computation resource, training individual single networks are better, but sometimes need other tasks to help with regularization
Task grouping (multitask learning) is inversely correlated with task transferability. Thus it is better to train dissimilar tasks together. –> This is somewhat counter-intuitive. The authors argue that this will give more meaningful regularization.
The paper proposed two methods to reduce computation burden
- Early stopping: validation score at 0.2 epoch already correlates with final score pretty well. This saves 20x computation resource.
- High order approximation: train all single and dual task models and use them to approximate higher order grouping. This reduces computation from exponential combination to quadratic.

Technical details

Hard parameter sharing: same backbone/encoder
Soft parameter sharing: Same architecture and L2 distance penalty between corresponding weights. Or added peephole connection between corresponding weights. –> This does not improve inferenece time.

Notes

Questions and notes on how to improve/revise the current work