CVPR18 Best Paper Speech: Studying the connection between tasks is the right approach to transfer learning
▲Click above Leifeng.com Follow
Text | Yang Xiaofan
Report from Leiphone.com (leiphone-sz)
Leifeng.com AI Technology Review: This year's CVPR 2018 best paper "Taskonomy: Disentangling Task Transfer Learning" studied a very novel topic, which is to study the relationship between visual tasks. The relationship can help transfer learning between different tasks. Compared with the research we are used to seeing for scoring various tasks, this paper can be said to be a spring breeze in the field of computer vision.
During CVPR 2018, Leifeng.com AI Technology Review was the only registered media to report on the event and attended the presentation of the paper. The speaker was Amir R. Zamir, the first author of the paper, who is a postdoctoral researcher at Stanford and UC Berkeley. When he was still a doctoral student, he also won the Best Student Paper Award at CVPR 2016 with his paper "Structural-RNN: Deep Learning on Spatio-Temporal Graphs" (http://arxiv.org/abs/1511.05298).
The following is the full text of the speech:
Amir R. Zamir: Good morning, everyone. Let me introduce our paper “Taskonomy: Disentangling Task Transfer Learning”, which I co-authored with Alexander Sax, William Shen, Leonidas Guibas, Jitendra Malik and Silvio Savarese.
We first raised a question: Is there any relationship between visual tasks? Or are they all independent? For example, is there any relationship between depth estimation and surface normal prediction, or between object recognition and indoor layout recognition? We think the answer is yes, whether from our intuition or with the help of some knowledge. For example, we know that surface normal prediction model, depth prediction model or indoor layout model can bring considerable help to object recognition. So there must be some relationship between tasks.
So what are the impacts of these relationships? What important role do they play? That's what I want to tell you today.
I would like to introduce these four key points:
-
There is a relationship between tasks
-
These relationships can be derived computationally without requiring our knowledge as humans.
-
Tasks belong to a structured space, not to separate concepts
-
It can provide us with a unified model for transfer learning
The examples I just showed are just a few of many visual tasks. We can ask questions like this for any of the tasks: Is there any relationship between them? How much is the relationship? To answer these questions, we need to have a global understanding of the relationship and redundancy between tasks. We need to view the tasks as a group, not as individual tasks. We use the relationship and redundancy between them to achieve higher efficiency.
One of the most interesting aspects worth improving is the efficiency of supervision, that is, we hope to solve the problem with less labeled data, and this is the focus of our research. Many research papers have discussed how to reduce the model's demand for labeled data. There are also methods such as self-supervised learning, unsupervised learning, meta-learning, task adaptation, and fine-tuning based on the features learned on ImageNet, which has now become a common practice.
In fact, transfer learning is possible because of these relationships between tasks. At a high level, if we can transfer or translate the internal state learned by a model, this may help us learn to solve other tasks - if there is some relationship between the two tasks. Let me talk about this part in detail.
Taking the surface normal prediction task as an example, we specially train a neural network to predict the normal of the plane in the image, and the effect is obviously good. If we only use 2% of the training data to train the network, we will get the result in the lower left corner, and obviously we can guess that the result will be bad.
Then we migrated models from two other tasks, image reshading and image segmentation, and trained them on a small replica network using the same 2% of the data.
As you can see, the image segmentation model does not perform well after migration, but the image reshaping model performs well in surface normal prediction. This shows that the relationship between image reshaping and surface normal prediction is stronger and closer than the relationship between image segmentation and surface normal prediction. Intuitively, this makes sense. We think that when we reshape a scene, it will have a lot to do with the normals of the plane. In contrast, I can't think of any relationship between normal prediction and image segmentation, or maybe there is really no relationship. So we observe that for tasks that are related to each other, we only need a little extra information to help the model of one task solve another task.
If you can have some way to quantify the relationship between a large number of any given tasks, we can get a complete graph structure. This is the kind of global window we are looking for to understand the redundancy between different tasks. For example, it can use the redundancy between tasks we mentioned earlier to solve a series of supervised learning tasks, and can migrate old tasks to new tasks with only a few resources; or solve a completely new task, we have almost no labeled data required for this task. Then learning to solve new tasks now becomes adding some content to the original structure without starting from scratch.
This is the purpose of our "Taskonomy", a completely computational method that can quantify the relationship between a large number of tasks, propose a unified structure between them, and use it as a model for transfer learning. We call it "Taskonomy", which is a combination of the two words task and taxonomony, which means to learn a transfer strategy from a taxonomic perspective.
Here's what we did. First, we found a set of 26 tasks, including semantic, 2D, 2.5D, and 3D tasks. We didn't go into detail on the selection of a wider range of visual tasks, these are just a sample set for our demonstration; I'll come back to the selection of the list of tasks involved in the calculation later. We collected about 4 million photos of indoor objects, and then each photo was prepared for all 26 tasks. These images are all real, not generated; for 3D visual tasks, we used a structured light sensor to scan the corresponding indoor scene structure, so that we can also more easily get the ground truth for these tasks.
Next, we trained 26 task-specific neural networks for each of the 26 tasks in the task list, and these images are the outputs of these 26 tasks. 3D tasks include curvature prediction, semantic tasks include object recognition, and some tasks are self-supervised, such as coloring.
Our task learning model has four main steps. The first step is to train these 26 task-specific networks and then lock the weights; at this time, each task has a neural network trained specifically for it.
The next step is to quantify the relationship between different tasks. For example, taking the relationship between normal estimation and curvature estimation as an example, we train a small replica model with the normal estimation model with locked weights, and this small model tries to calculate curvature from the representation of the normal estimation model. Then we evaluate the performance of the small model on new test data. This performance is the basis for evaluating the direct transfer relationship between the two specific tasks.
So, there are 26 x 25 combinations of tasks in the list, and we trained and evaluated them all. In this way, we get the complete task relationship graph structure we want. However, the values between nodes need to be standardized because these tasks belong to different output spaces and have different mathematical properties. For the description of the relationship, we calculated the adjacency matrix of the entire graph structure. From the calculation results, it can be clearly seen that some things play a decisive role in the matrix. The reason is that these tasks exist in different output spaces, so we need to standardize them.
The way we normalize the matrix is an ordinal method called the Analytic Hierarchical Process. I won't go into detail here, but in short, we chose an ordinal model because it makes assumptions about the mathematical properties of the output space that are critical to us compared to some other analytical methods. For more details, please refer to our paper.
Then, this complete relationship graph structure is completely quantified. For a pair of tasks, its value is the degree of dependency of task migration. It is worth noting that not every migration between two tasks is useful. There are many weak relationships between tasks. But of course there are some strong relationships, and there are some obvious patterns.
We hope to extract this sparse relationship from the complete graph structure. For us, the extracted structure can help us maximize the performance of the original task and tell us how to choose the best source task for the new task, which source task can be transferred to as many tasks as possible, and how to transfer to a new task that is not included in the task list.
In this step, what we do can be simply formalized as a subgraph selection problem. We have defined a list of tasks, where previously seen tasks are represented by gray nodes and new tasks are represented by red nodes. Then we set some restrictions and use binary extraction to calculate the optimal subgraph. The calculation details can be found in our paper or poster, which is quite simple and straightforward.
The results of subgraph extraction provide us with the connectivity we need to solve each task, including new tasks, and how to maximize the performance of each task using limited resources and resources that do not exceed the user-defined limit. The number of user-defined resources actually determines how large the task list of the source task can be. This is our goal.
(Leiphone AI Technology Review Note: There is another diagram of the whole process in the paper as follows)
Another point I didn't have time to elaborate on is high-level task migration, which is that two or more tasks can be used together as source tasks and resource overheads to be calculated within our framework. In fact, our adjacency matrix is much larger than 26 x 25 because there are many-to-one situations.
Here are the results. For a task list of 26 tasks, there were 3,000 task migration networks, which took 47,829 hours of GPU time. Training the model cost us about $47,000. Only 1% of the training data of the task-specific network was used to train the migrated model.
This is an example of a task classification result. This example includes the 26 tasks I mentioned earlier, of which 4 are target tasks, which means they have very little data, which is just enough to train the copied migration network, but not to train a new network from scratch. If you look at the connectivity of these tasks, you will intuitively feel that this is the case. There is a stronger connection between 3D tasks and other 3D tasks, and the similarity with 2D tasks is very low.
To evaluate the effectiveness of our migration strategy, we propose two evaluation metrics, Gain and Quality. Gain refers to the win rate of the migrated network compared to the performance of the task-specific network trained from scratch with the same small amount of data; the darker the blue in the figure, the higher the win rate, which means that the migration effect is always better. Quality refers to the win rate of the migrated network compared to the task-specific network trained with all the data. We can see that in many cases it is white, which means that the performance of these migrated models is already as good as the task-specific network as the gold standard.
This is just an example of taxonomy. You can write your own implementation, but the best way is to try our online real-time API, taskonomy.vision/api, where you can set the parameters you want and view the qualitative and quantitative calculation results of task taxonomy. It is worth mentioning that our taxonomy results are very suitable for use with ImageNet features, because it is also the most commonly used feature. We have also done some experiments related to ImageNet, and you are welcome to read our paper.
Finally, let’s summarize:
-
We have taken a positive step toward our goal of understanding the space of visual tasks;
-
We view tasks as groups in a structured space rather than as individual concepts; this diagram is drawn based on the quantitative relationships.
-
This is a completely computable framework;
-
It can help us do transfer learning and also help us find a generalized perception model.
If you visit our website http://taskonomy.stanford.edu/ you can also see a YouTube introduction video. Thanks!
(over)
In fact, the website http://taskonomy.stanford.edu/ has a wealth of research introductions and resources, including real-time demos, APIs for customized task computing, visualization of the migration process, pre-trained models, dataset downloads, etc. As the author said, those who want to know more can read the original paper and visit their website.
Paper address: http://taskonomy.stanford.edu/taskonomy_CVPR2018.pdf
Leifeng.com AI technology review report.
Featured Posts