Auxiliary Learning in Transfer Learning
Write a literature review on the topic of auxiliary learning in transfer learning
Introduction: Briefly introduce the topic and state the motivation behind choosing it.
Background: Explain the foundational concepts and theories related to your topic.
Survey of Existing Literature: Review and analyze the key research papers, highlighting their contributions and limitations.
[Important] Synthesis and Critical Analysis: Compare and contrast the approaches, and discuss the common trends and open challenges in the area.
Conclusion: Summarize your findings and suggest possible future research directions.
some possible citations might be:
1. Liu, Shikun, Andrew J. Davison, and Edward Johns. “Self-Supervised Generalisation with Meta Auxiliary Learning.” arXiv.org, November 26, 2019. https://arxiv(dot)org/abs/1901.08933.
2. Jaderberg, Max, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. “Reinforcement Learning with Unsupervised Auxiliary Tasks.” arXiv.org, November 16, 2016. https://arxiv(dot)org/abs/1611.05397.
3.Trinh, Trieu H., Andrew M. Dai, Minh-Thang Luong, and Quoc V. Le. “Learning Longer-Term Dependencies in RNNS with Auxiliary Losses.” arXiv.org, June 13, 2018. https://arxiv(dot)org/abs/1803.00144.
4.Du, Yunshu, Wojciech M. Czarnecki, Siddhant M. Jayakumar, Mehrdad Farajtabar, Razvan Pascanu, and Balaji Lakshminarayanan. “Adapting Auxiliary Losses Using Gradient Similarity.” arXiv.org, November 25, 2020. https://arxiv(dot)org/abs/1812.02224.
5.Dery, Lucio M., Paul Michel, Mikhail Khodak, Graham Neubig, and Ameet Talwalkar. “Aang: Automating Auxiliary Learning.” arXiv.org, May 27, 2022. https://arxiv(dot)org/abs/2205.14082.
6.Dery, Lucio M., Yann Dauphin, and David Grangier. “Auxiliary Task Update Decomposition: The Good, the Bad and the Neutral.” OpenReview, September 28, 2020. https://openreview(dot)net/forum?id=1GTma8HwlYp.
Please get more than 10 references
Auxiliary Learning in Transfer Learning
Name
Institution
Course
Instructor
Due Date
Auxiliary Learning in Transfer Learning
Introduction
The development of up-to-date neural networks such as GPT-3, BERT, and RoBERTa requires the application of massive amounts of training data as well as the use of massive computational resources. Notably, a key challenge that many developers face when training such robust models is the scarcity of data, especially in an attempt to train such models in low-resource tasks. Transfer learning is an approach that allows one to incorporate auxiliary data to facilitate the model’s ability to complete the main data. Transfer learning is a machine learning technique that involves utilizing the knowledge learned from one task to improve performance on a related task. The motivation behind transfer learning is to reduce the amount of labeled data required for training a model on a new task and has shown promising results in domains such as computer vision, natural language processing, and speech recognition (Zhuang et al. 2020). One of the key challenges in transfer learning is how to effectively transfer the knowledge learned from the source task to the target task. Auxiliary learning is a technique that involves training a model on multiple related tasks simultaneously. The goal of auxiliary learning is to improve performance on the primary task by utilizing information learned from the auxiliary tasks.
In auxiliary learning, the model learns a shared representation that can be used across multiple tasks. The shared representation can capture the commonalities between the tasks, and the auxiliary tasks can provide additional information to help the model better understand the primary task. By utilizing auxiliary learning, the model can leverage the knowledge learned from the related tasks to improve its performance on the target task. There are different approaches to incorporating auxiliary learning in transfer learning. One approach is to pre-train the model on a large corpus of data using an unsupervised learning approach (Kalyan et al., 2021). The pre-trained model can then be fine-tuned on the target task using a smaller labeled dataset. Another approach is to train the model on multiple related tasks simultaneously, with the goal of improving the performance of the primary task.
Several studies have shown the effectiveness of auxiliary learning in improving the performance of transfer learning. For example, in natural language processing, pretraining a language model on a large corpus of data using an unsupervised learning approach has been shown to improve performance on various downstream tasks such as sentiment analysis, named entity recognition, and question answering. Similarly, in computer vision, training a model on multiple related tasks such as object detection, image segmentation, and image classification simultaneously improves the performance of the primary task.
Background Information
There are several approaches that have been found to be effective in facilitating the selection of the most appropriate auxiliary task for the main task. First, gradient similarity for auxiliary losses is a relatively new approach that has gained attention in recent literature in the field of machine learning, specifically in the context of auxiliary learning. This approach involves measuring the similarity between the gradients of the primary task and the auxiliary tasks, with the goal of improving the performance of transfer learning models. In traditional auxiliary learning approaches, the primary task and the auxiliary tasks are trained simultaneously, and the gradients of all tasks are combined to update the model parameters (Du et al., 2021). However, this approach can lead to conflicting gradients, where the gradients of the auxiliary tasks can pull the model away from the optimal solution for the primary task. To address this issue, gradient similarity for auxiliary losses was proposed to ensure that the gradients of the auxiliary tasks are aligned with the gradients of the primary task.
Auxiliary modules in deep multi-task learning (dMTL) are additional layers or components added to a neural network architecture that can help improve the performance of multiple tasks simultaneously. These modules can be designed to extract relevant features, reduce noise, or regularize the training of the model across multiple tasks. Auxiliary modules can be added to any neural network architecture, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformer-based models (Thakur et al., 2022). The use of auxiliary modules in dMTL has been shown to be effective in improving the performance of the model on multiple tasks. One popular type of auxiliary module is the auxiliary classifier, which is a small neural network that is trained to classify the input data at an intermediate layer of the main network. The output of the auxiliary classifier is used as an auxiliary loss that is combined with the main loss to train the model. The idea behind the auxiliary classifier is to encourage the model to learn features that are relevant to multiple tasks simultaneously.
Gated Multi-task Networks (GMTNs) are a recent approach in auxiliary learning that has been shown to be effective in improving the performance of transfer learning models. GMTNs are a type of deep multi-task learning (dMTL) model that uses gating mechanisms to selectively focus on the most relevant task at any given time, while also sharing information across all tasks (Xiao, Zhang, & Chen, 2018). The gating mechanism in a GMTN is a set of trainable parameters that determine how much attention the model should pay to each task. These gating parameters are learned during training and are used to control the flow of information through the network. By selectively gating the flow of information, the model can effectively learn to share information across tasks, while also being able to focus on the task that is most relevant at any given time.
AUTOSEM (Automatic Task Selection and Mixing) is a recent approach in auxiliary learning that aims to automate the process of selecting and mixing tasks in multi-task learning (MTL) settings. The main idea behind AUTOSEM is to use a reinforcement learning algorithm to dynamically select and mix tasks during the training process, based on the current state of the model (Dery et al., 2022). The AUTOSEM algorithm consists of three main components: a state representation, an action space, and a reward function. The state representation is a set of features that describe the current state of the model, including the current accuracy of each task and the amount of training data available for each task. The action space is a set of possible task-mixing strategies that the algorithm can choose from, such as mixing all tasks equally, focusing on a subset of tasks, or prioritizing tasks based on their current accuracy. The reward function is a measure of how well the model is performing on the tasks and is used to guide the reinforcement learning algorithm toward selecting task-mixing strategies that lead to better overall performance.
Literature Review
Notably, several studies have been conducted providing empirical evidence on the effectiveness of various proposed approaches that one can take to ensure the optimization of auxiliary task selection. As indicated above the most recent approaches to auxiliary task selection are gradient similarity for auxiliary losses, auxiliary modules in dMTL, Gated Multi-task network, and AUTOSEM.
Gradient Similarity for Auxiliary Losses
In their theoretical paper, Du et al. (2021) explored the use of gradient similarity to guarantee that auxiliary loss does not have an impact on the main task learning by making sure that gradient of the auxiliary task is only applied in instances where they are a descent direction of the primary task. An empirical investigation of their work showed that the proposed hypothesis offered an elegant approach for determining the most appropriate auxiliary task. The work found that the cosine similarity of gradient offered a generalizable measure that could be applied across domains and provided evidence that one could use the cosine similarity to detect instances where a previous useful auxiliary task was harming the primary task and blocking the unwanted interference.
In a mathematic formula where Θ is the parameters of the model, L_p is the loss function for the primary task, and L_i is the loss function for the i-th auxiliary task. If ∇L_p(Θ) and ∇L_i(Θ) are the gradients of the loss functions with respect to the parameters Θ for the primary and i-th auxiliary tasks, respectively. The cosine similarity of gradients between the primary and i-th auxiliary tasks is given by:
cos_sim(∇L_p(Θ), ∇L_i(Θ)) = (∇L_p(Θ) ⋅ ∇L_i(Θ)) / (||∇L_p(Θ)|| ⋅ ||∇L_i(Θ)||) (where ⋅ denotes the dot product and ||.|| denotes the L2 norm. The cosine similarity ranges between -1 and 1, where a value of 1 indicates that the gradients are identical, a value of 0 indicates that the gradients are orthogonal, and a value of -1 indicates that the gradients are opposite).
During training, the cosine similarity of gradients would be computed periodically to measure the similarity between the primary and auxiliary tasks. If the cosine similarity falls below a certain threshold, it indicates that the auxiliary task is not providing useful information and may be interfering with the optimization of the primary task. In this case, the adaptive weight assigned to the auxiliary task is reduced or the auxiliary task is removed altogether. Notably, the approach exploits the high dependence on the parametrization of the model and the values of the parameters as one of the essential elements of task similarity for transfer by providing a similarity index that heuristically measures parameters and leverages the adaptive weight following the update of the model (Du et al., 2021).
In the same vein, He et al. (2022) proposed an approach that uses MetaBalance to balance the auxiliary losses through manipulation of the gradient of the tasks that are aligned with the shared parameters within a multi-task network. The approach focuses on making sure that the gradient of the auxiliary loss is manipulated to ensure that the auxiliary task is not strong enough to harm the primary task or too weak to make any significant contribution to the primary task. The approach is highly flexible as it can be adapted to different scenarios. In the evaluation, He et al. (2021) found that the use of their approach resulted in 8.34 percent in terms of NDCG@10 when two real-world datasets were used.
Auxiliary Module in dMTL
In their study, Liu et al. (2019) seek to deal with the competing gradient problem using auxiliary modules. To this end, the authors explicitly designed auxiliary modules jointly optimized using shared layers and supervised by auxiliary task objectives. Liu et al. (2019) observe that the shared layers during training served as a regularizer by introducing hierarchical inductive bias to eliminate the competing gradient problem that occurs when the gradient of the auxiliary task interferes with the optimization of the primary task. The introduction of the auxiliary models in the approach improved the optimization of each task given that the modules provided additional signals that facilitated the learning of task-specific features through the supervision of auxiliary tasks. The approach can be represented by a mathematical formula where x is the input to the network, y_p is the output of the primary task, and y_i is the output of the i-th auxiliary task. If Θ is the parameters of the shared layers, and Θ_p and Θ_i are the task-specific parameters. The loss function for the joint optimization of primary and auxiliary tasks would be given by:
L(Θ, Θ_p, Θ_i) = L_p(y_p, f(x; Θ, Θ_p)) + Σ(w_i * L_i(y_i, g(x; Θ, Θ_i))): where f(x; Θ, Θ_p) and g(x; Θ, Θ_i) are the task-specific sub-networks that share the same layers, L_p and L_i are the loss functions for the primary and i-th auxiliary task, respectively, and w_i is the weight given to the i-th auxiliary task.
During training, the network learns the shared layers and task-specific parameters jointly by minimizing the loss function L(Θ, Θ_p, Θ_i) using gradient descent. The use of shared layers during training serves as a regularizer by introducing hierarchical inductive bias to eliminate the competing gradient problem that occurs when the gradient of the auxiliary task interferes with the optimization of the primary task. The introduction of the auxiliary tasks provides additional signals that facilitate the learning of task-specific features through supervision. During testing, the auxiliary modules are discarded without increasing any computational complexity to the MTL framework to facilitate the deployment of the framework in real-world scenarios without need for additional computational overhead (Liu et al., 2021).
Gated Multi-task network
Xiao, Zhang, and Chen's (2018) study proposed a novel approach for reducing interference between auxiliary and main tasks in Convolutional Neural Networks. They introduced the use of gate mechanisms to filter feature flows between the two tasks. The gate mechanism selectively allows or blocks the flow of features from the auxiliary task to the main task based on the relevance of the feature to the main task. This approach is highly effective as it learns the selection rules automatically, thus minimizing interference between the two tasks. The authors tested their approach on three different datasets, and the results showed a significant improvement in performance compared to the baselines. Specifically, the proposed approach achieved higher accuracy on the main task while maintaining similar or improved performance on the auxiliary tasks. The authors also conducted additional experiments to verify the effectiveness of their approach, including analyzing the impact of different gate mechanisms and varyi...