Open access peer-reviewed chapter - ONLINE FIRST

Bridging Domains with Transfer Learning: Strategies for Multi-Task and Multi-Dataset Finetuning

Written By

Tehseen Ullah, Steven Davy and John Kelleher

Submitted: 16 June 2025 Reviewed: 01 July 2025 Published: 03 August 2025

DOI: 10.5772/intechopen.1011820

Transfer Learning - Unlocking the Power of Pretrained Models IntechOpen
Transfer Learning - Unlocking the Power of Pretrained Models Edited by Pier Luigi Mazzeo

From the Edited Volume

Transfer Learning - Unlocking the Power of Pretrained Models [Working Title]

Dr. Pier Luigi Mazzeo and Associate Prof. Alessandro Bruno

Chapter metrics overview

8 Chapter Downloads

View Full Metrics

Abstract

The chapter explores the innovative integration of transfer learning with multi-dataset and multi-task fine-tuning to build versatile, resource-efficient models for computer vision-related tasks. It delves into the paradigm shift from traditional single-task training, where models are optimized for isolated datasets to a unified multi-task framework where a shared backbone supports diverse, domain-specific tasks. By leveraging pretrained models and implementing different multi-datasets and multi-tasks fine-tuning techniques such as mixture of experts, model merging, adaptive multi-task learning and modular architectures for multi-dataset training techniques, the discussed approaches enhance performance across varied datasets as well as significantly reduce computational cost and training time. This chapter addresses challenges such as data imbalance, conflicting gradients and sensitivity to merging coefficients, outlining how modern transfer learning techniques can reconcile these issues to produce scalable models that are both efficient and adaptable to real-world scenarios.

Keywords

  • finetuning
  • transfer learning
  • image classification
  • computer vision
  • multi-dataset finetuning
  • multi-task learning
  • resource efficient finetuning
  • fine-tuning techniques
  • mixture of experts

1. Introduction

The field of artificial intelligence (AI) has witnessed unprecedented growth over the past decade, driven largely by advances in deep learning. From diagnosing medical conditions to enabling autonomous vehicles, machine learning models have become indispensable tools for solving complex real-world problems. Yet, as these applications grow more sophisticated, a critical challenge emerges: how can we develop models that adapt efficiently to diverse tasks and datasets without requiring exorbitant computational resources or massive amounts of labeled data? This question lies at the heart of modern AI research and serves as the foundation for this chapter’s exploration of transfer learning, a paradigm that re-imagines how models acquire, retain, and reuse knowledge across domains.

1.1 The evolution of model training: From isolation to integration

Traditional approaches to training machine learning models, particularly in computer vision, have long relied on a single-task and single-dataset framework. In this paradigm, a model is meticulously trained from scratch for a specific objective, such as classifying images in the CIFAR-100 [1] dataset or detecting objects in COCO [2]. While this method has yielded remarkable results, it suffers from significant limitations. For example, training models in isolation demands vast amounts of task-specific labeled data, a luxury unavailable in many practical scenarios, such as medical imaging or rare defect detection in manufacturing. Furthermore, these models often lack versatility; a network trained exclusively for facial recognition cannot pivot to segmenting tumors in MRI scans without starting over and training it from scratch for that specific application. This rigidity is compounded by the computational costs of training separate models for each task, which strains resources and hinders scalability [3].

To counter these limitations, researchers and engineers introduced the so-called transfer learning technique, a technique that has reshaped the AI landscape. At its core, transfer learning allows a model trained on one task or dataset to repurpose its knowledge for a related but distinct task. For instance, a neural network pretrained on millions of generic images can be “fine-tuned” to recognize specific types of plants with only a small botanical dataset [4]. This approach mimics how humans learn, applying foundational knowledge from one domain to accelerate mastery in another. Early applications of transfer learning focused on single-task adaptation, but the research community soon recognized its potential for broader integration and came up with the idea of what if a model could simultaneously tackle multiple tasks or generalize across multiple datasets, sharing insights between them? This question catalyzed a paradigm shift toward multi-task and multi-dataset fine-tuning, which is the focus of this chapter.

1.2 The promise of multi-task and multi-dataset learning

Imagine a single model that can analyze satellite imagery to predict weather patterns, detect deforestation, and monitor urban sprawl, all while drawing from datasets as varied as aerial photographs, climate logs, and social media feeds. Such a model would help save computational resources as well as uncover latent connections between seemingly unrelated tasks. This vision underpins the strategies discussed in this chapter. By unifying tasks and datasets under a shared framework, we can create models that are both resource-efficient and adaptable [5].

The key to this unification is the concept of a shared backbone, a common neural network architecture that processes input data and extracts features usable across multiple tasks. For example, the early layers of a convolutional neural network (CNN) might learn to detect edges and textures relevant to both object detection and semantic segmentation [6], while task-specific “heads” branch off from the backbone will learn to generate specialized outputs. This design drastically reduces redundancy. This means that instead of training separate models for each task, a model’s generalized features serve as a foundation, minimizing computational overhead. However, integrating multiple tasks and datasets is far from straightforward. Challenges such as data imbalance, conflicting gradients, and sensitivity to hyperparameters threaten to destabilize the training process. These obstacles demand innovative solutions such as: mixture of experts (MoE), model merging, adaptive multi-task learning, and other modular architectures. All of these innovative techniques will be discussed in more details in the later sections.

1.3 Bridging domains in real-world applications

The implications of these approaches extend far beyond academic benchmarks. Consider health care, where a model trained on diverse datasets such as X-rays from one hospital and pathology slides from another could improve diagnostic accuracy while preserving patient privacy through federated learning [7]. Similarly, in autonomous driving, a vehicle’s perception system must simultaneously detect pedestrians, read traffic signs, and predict lane boundaries, often under varying weather conditions. A multi-task model trained on datasets spanning sunny, rainy, and snowy environments would inherently generalize better than a collection of single-task models [8].

Moreover, multi-dataset training mitigates the “domain gap” problem, where a model’s performance plummets when applied to data that deviates from its training distribution. By exposing the model to varied datasets during fine-tuning, it learns robust features that transcend individual domains [9]. This is particularly vital in global applications, where data may reflect cultural, geographic, or demographic diversity.

1.4 Navigating challenges: Toward efficient and scalable AI

While the benefits of multi-task learning and multi-dataset fine-tuning are compelling, they raise ethical and practical considerations. For example, combining datasets risks amplifying biases if not carefully audited. A model trained on facial recognition data from predominantly one demographic group may perpetuate inequities when deployed in multi-ethnic regions [10]. Similarly, the computational savings of shared backbones must be weighed against the energy costs of training large models, a tension underscoring the need for efficiency-focused architectures [11].

This chapter does not merely catalog these challenges but provides a roadmap for addressing them. By reconciling conflicting gradients through adaptive loss weighting or designing modular systems that compartmentalize tasks, modern transfer learning techniques pave the way for scalable and sustainable AI systems.

2. Foundations of transfer learning in computer vision

Transfer learning (TL) has revolutionized computer vision by enabling models to repurpose knowledge from one task or dataset to solve new, related problems. This section builds the conceptual bedrock for the rest of the chapter, clarifying how TL evolved from its single-task origins to today’s multi-task and multi-dataset paradigms. By anchoring technical strategies in their historical and theoretical context, we equip experts and non-experts alike to grasp the innovations discussed in later sections.

At its core, TL relies on the idea that models trained on large, diverse datasets learn reusable features such as textures, object parts that generalize across domains. For example, convolutional layers of a network pretrained on natural images can extract meaningful patterns from medical scans, reducing the need for extensive labeled healthcare data. Figure 1 illustrates this process, where the pretrained convolutional layers are reused as a frozen feature extractor, while task-specific dense layers are replaced with a new classifier tailored to the target domain.

Figure 1.

Illustration of the transfer learning process from a source to a target domain by reusing pretrained convolutional layers. The dense layers are replaced with a new classifier tailored to the target labels, while the feature extractor remains frozen.

While early TL focused on adapting models to single tasks (e.g., fine-tuning ResNet for flower classification), real-world applications demand systems that handle multiple tasks (e.g., autonomous vehicles detecting pedestrians, traffic signs, and lanes simultaneously) and diverse datasets (e.g., merging satellite imagery and street-view photos). This shift necessitated architectures and training strategies that balance efficiency, scalability, and performance.

2.1 From single-task adaptation to multi-domain generalization

The limitations of single-task TL became apparent as AI applications grew more complex. Training separate models for each task is computationally expensive and waste of resources. Moreover, features learned in isolation often fail to generalize across domains. The following breakthroughs addressed these issues:

2.1.1 Multi-task learning (MTL)

Multi-task learning (MTL) trains a single model to perform multiple tasks by sharing the same backbone network while using task-specific output heads. For instance, a shared CNN backbone can extract features for both object detection (bounding boxes) and semantic segmentation (pixel masks). Key innovations include:

  • Adaptive loss balancing: Dynamically weighting task losses to prevent one task from dominating the training [12].

  • Gradient conflict mitigation: Techniques like PCGrad [13] adjust gradients during backpropagation to avoid interference between tasks.

2.1.2 Multi-dataset training

Models trained on diverse datasets (e.g., combining COCO, ADE20K, and DomainNet) learn robust, domain-invariant features. For example, a model exposed to sketches, paintings, and photos can better recognize objects in abstract art [14]. However, merging datasets introduces the following challenges:

  • Label incompatibility: Resolving mismatched annotation formats (e.g., bounding boxes vs. segmentation masks).

  • Domain gaps: Aligning feature distributions across datasets with differing styles (e.g., medical scans vs. natural images).

Recent methods like dataset condensation and model soups [15, 16] address these issues by synthesizing representative data subsets or averaging weights of models fine-tuned on individual datasets.

2.2 The rise of generalist models

The pursuit of adaptable, multi-purpose systems have driven the emergence of generalist models capable of handling diverse tasks and modalities with minimal task-specific tuning. Central to this shift are advances in self-supervised learning (SSL) and vision-language models (VLMs), which learn rich, transferable representations without relying on labeled data. For instance, masked autoencoders like MAE [17] reconstruct images from partial inputs to capture spatial hierarchies, while VLMs such as CLIP [18] align text and image embeddings for zero-shot inference such as classifying medical scans using descriptive prompts. Complementing these innovations, parameter-efficient fine-tuning (PEFT) techniques like LoRA and adapters [19, 20] enable lightweight updates to massive pretrained models, making them viable for edge deployment. Together, these approaches reduce reliance on task-specific data and expensive computations, paving the way for models that fluidly adapt to multi-domain challenges, ranging from analyzing satellite imagery with text-guided prompts to diagnosing diseases across varied medical datasets while maintaining efficiency and scalability.

3. Strategies for multi-task and multi-domain fine-tuning

The shift from isolated, single-task models to unified systems capable of learning from multiple datasets and tasks represents a pivotal advancement in modern deep learning paradigm. However, this integration introduces unique challenges such as: How can models reconcile conflicting objectives across tasks? How can heterogeneous datasets, with mismatched labels, domains, or scales could be harmonized without catastrophic interference? In the following sections, we will explore some cutting-edge research and methodologies that address these challenges, balancing computational efficiency with robust performance. Central to these approaches is the principle of shared representation learning, where a single deep learning model extracts features relevant to diverse tasks (e.g., object detection and segmentation) and datasets (e.g., natural images and medical scans). Techniques like parameter-efficient fine-tuning (PEFT) minimize training costs by updating only small subsets of pretrained weights, while adaptive multi-task learning (MTL) dynamically balances task losses to prevent dominance by data-rich objectives. Meanwhile, modular approaches like mixture of experts (MoE) and model merging enable multi-task learning without sacrificing scalability, routing inputs to domain-specific subnetworks or combining pretrained models into cohesive systems. Figure 2 illustrates a unified view of prominent parameter-efficient fine-tuning (PEFT) approaches, including adapter tuning, prompt tuning, prefix tuning, and side tuning, each offering different trade-offs between trainable parameters, architectural modification, and performance efficiency. These innovative approaches collectively aim to overcome data imbalance, domain shifts, and gradient conflicts. The following subsections dissect these methodologies, highlighting their theoretical underpinnings, implementation trade-offs, and applications across different industries.

Figure 2.

Detailed architecture of various PEFT methods.

3.1 Parameter-efficient fine-tuning (PEFT)

Parameter-efficient fine-tuning (PEFT) has emerged as a pivotal method for adapting large-scale pretrained vision models (PVMs) to downstream tasks without the computational and storage burdens of full fine-tuning method. Unlike conventional approaches that update all of the model’s parameters, PEFT updates only a small subset of weights or introduces lightweight modules, preserving the pretrained knowledge while achieving competitive performance. This section synthesizes the latest advancements in PEFT, focusing on methodologies, applications, and challenges, with insights drawn from foundational and cutting-edge research.

3.1.1 Methodologies and taxonomies

PEFT methods are broadly categorized into three paradigms: addition-based, partial-based, and unified-based fine-tuning [21]. Addition-based methods insert trainable modules into the PVM architecture, such as adapters or prompts, while keeping the original weights frozen. For instance, adapter-based modules employ bottleneck structures with down-projection and up-projection layers to transform intermediate features. These modules, integrated into transformer layers, enable task-specific adaptation with minimal parameter overhead, typically 0.05–10% of the full model size. Vision prompt tuning (VPT) [22] extends this idea by prepending learnable prompts to input embeddings, aligning downstream data distributions with pretraining objectives.

On the other hand, partial-based methods update only specific subsets of existing parameters. For example, BitFit [23] fine-tunes bias terms exclusively, achieving over 95% of full fine-tuning performance on benchmarks, while LoRA [19] injects low-rank matrices into attention layers to approximate weight updates. These methods excel in efficiency, often training less than 1% of total parameters. Unified-based approaches combine multiple strategies into cohesive frameworks. For example, the VMT-Adapter [24], a state-of-the-art method, unifies shared projections for cross-task interaction and task-specific modules for dense scene understanding. By decomposing parameters via Kronecker products, VMT-Adapter-Lite further reduces trainable weights to 0.36% of the original model while maintaining competitive performance.

3.1.2 Applications and performance

PEFT has been widely applied to tasks ranging from image classification to dense scene understanding. As shown in Table 1, VMT-Adapter significantly outperforms full fine-tuning on PASCAL-Context dataset with only 1% of trainable parameters, achieving +3.96% performance gain, while its lightweight variant (VMT-Adapter-Lite) also provides strong results with minimal computational footprint, highlighting the practical advantage of unified PEFT methods in multi-task settings. Similarly, adapters like AdaptFormer [25] demonstrate robust performance on fine-grained visual classification (FGVC) and video action recognition, highlighting their versatility. A critical advantage of PEFT lies in its computational efficiency. Methods like LST [26] which decouple gradient backpropagation from the frozen backbone, reducing GPU memory usage by up to 70%. For trillion-parameter models, such efficiencies are indispensable. However, challenges persist, including gradient conflicts in multi-task settings and interpretability of learned prompts or adapters.

MethodParams (M)Perf. GainTasks
Full Fine-Tuning112.620% (baseline)Single
VMT-Adapter1.13+3.96%Multi
VMT-Adapter-Lite0.40+1.34%Multi
Shared Adapter4.71−0.64%Multi

Table 1.

Performance comparison of parameter-efficient fine-tuning methods on dense scene understanding tasks.

3.1.3 Challenges and future directions

Despite its promise, PEFT faces unresolved issues. Such as negative transfer, where irrelevant pretrained knowledge harms downstream performance, particularly in cross-domain scenarios [27]. Secondly, scaling PEFT to generative models (e.g., diffusion models) and multimodal systems (e.g., vision-language models) requires novel architectures. Recent work on DiffFit [28] fine-tunes diffusion models via bias and scaling terms, but broader exploration is needed. Finally, the lack of standardized libraries for visual PEFT hinders reproducibility. While NLP benefits from tools like Hugging Face’s PEFT, the vision community lacks equivalent resources, necessitating frameworks that unify diverse methods.

3.2 Multi-task and multi-domain learning

Multi-task learning (MTL) enhances model generalization by training a single model to perform multiple related tasks simultaneously, leveraging shared representations to exploit task commonalities while minimizing interference. This approach reduces computational costs and improves data efficiency compared to training separate task-specific models. Multi-domain learning (MDL), conversely, focuses on adapting models to perform robustly across diverse domains such as datasets, environments, or distributions, addressing challenges like domain shifts and label incompatibility. Both paradigms are critical for scalable transfer learning, enabling models to generalize across tasks and domains with minimal retraining.

In order to achieve this, modern approaches toward multi-task and multi-domain learning integrate mixture of experts (MoE), which dynamically routes inputs to specialized subnetworks for task- or domain-specific processing, and multi-dataset training, which harmonizes heterogeneous data through techniques like domain-invariant feature alignment. Additionally, model merging techniques combine pretrained models into unified architectures, preserving their individual strengths while reducing redundancy. These methods are particularly valuable in vision tasks where labeled data is scarce or fragmented across domains.

Recent advancements in transfer learning increasingly rely on mixture of experts (MoE) architectures to address the dual challenges of multi-task learning (MTL) and multi-domain learning (MDL). By dynamically routing inputs to specialized subnetworks, MoE frameworks enable models to balance shared knowledge with task- or domain-specific adaptations, minimizing interference and enhancing generalization. Two pioneering implementations: MoE-FFD for face forgery detection [29] and AdaMV-MoE for multi-task vision recognition [30] exemplify how MoE principles advance scalable and efficient learning across tasks and domains.

3.2.1 Adaptive feature extraction with MoE-FFD

MoE-FFD tackles the challenge of detecting deepfakes across diverse datasets such as CelebDF-v2 and DFDC-P, by integrating the global context of vision transformers (ViTs) with the local sensitivity of CNNs. The architecture, as shown in Figure 3, preserves the pretrained ViT backbone while injecting domain-specific adaptations through two modular components:

  • Low-rank adaptation (LoRA): Lightweight low-rank adaptation (LoRA) layers adjust the ViT’s attention mechanisms to capture subtle forgery clues, such as identity inconsistencies or unnatural textures, without overwriting the foundational ImageNet knowledge.

  • Convolutional adapters: Five specialized convolutional experts, including angular difference and central difference convolutions, localize artifacts like blending boundaries or noise patterns. Moreover, a gating network dynamically selects the optimal combination of experts for each input, ensuring domain-specific processing.

Figure 3.

Overview of the designed MoE-FFD framework. (a) Overall model architecture; (b)MoE-FFD transformer block; (c) design of MoE Adapter layer; (d) Details of each Adapter expert; (e) Details of the designed MoE LoRA layer; (f) Details of each LoRA expert.

Their results show that MoE-FFD achieves an 86.78% average AUC on unseen datasets with only 15.51 M activated parameters, outperforming full ViT fine-tuning by a significant margin. Notably, it maintains robustness under perturbations like Gaussian blur, retaining 72.3% AUC at high severity levels compared to ViT-B’s 54.1%. This efficiency stems from sparse activation where only relevant experts are engaged per input, making it ideal for deployment in resource-constrained environments.

3.2.2 Dynamic task specialization with AdaMV-MoE

AdaMV-MoE addresses the complexity of multi-task vision recognition through adaptive expert allocation. As illustrated in Figure 4, unlike conventional MTL models that rigidly share a backbone, AdaMV-MoE employs task-dependent routers and an adaptive expert selection (AES) mechanism. The AES monitors validation loss to dynamically adjust the number of experts per task, expanding capacity for complex tasks like detection while contracting it for simpler ones like classification. For example, detection tasks activate 50.94 M parameters compared to 42.95 M for classification, reflecting their higher computational demands. This approach mitigates gradient conflicts inherent in MTL by isolating task-specific computations. On the UViT-Base backbone [31], AdaMV-MoE achieves 79.65% classification accuracy and 44.14% detection AP, surpassing shared-backbone ViT models by 6.66% and 1.13%, respectively.

Figure 4.

High-level overview of AdaMV-MoE consists of both ViT (left) and SMoE (middle) layer, where the SMoE layer is built by replacing the original multi-layer perceptron (MLP) with a sparsely activated mixture of experts (MLPs). It enables multi-task learning by leveraging task-specific router networks (right). Each router network determines how many (adaptive) and which (specialized) experts are appropriate to activate for the given task.

3.2.3 Model merging for multi-task learning

In scenarios where collecting joint training data for all tasks or domains is impractical due to privacy constraints or computational costs, model merging presents an efficient alternative for enabling multi-task learning (MTL). This means that instead of retraining a single model from scratch or maintaining multiple task-specific models, recent approaches aim to merge independently fine-tuned models into one unified architecture capable of performing multiple tasks. A particularly promising example of such approach is AdaMerging, a data-free and unsupervised method for adaptive model merging [32]. The core idea is to autonomously learn how to combine multiple task-specific models, each fine-tuned on a shared pretrained backbone into a single multi-task model, without requiring access to the original training data.

AdaMerging builds upon the concept of task arithmetic, where each task is represented by a task vector, defined as the difference between the task-specific model parameters and the base pretrained model parameters. By summing these task vectors and merging them into the pretrained model, a new model can theoretically support multiple tasks. However, previous methods such as task arithmetic and TIES-Merging suffered from performance drops due to their reliance on fixed or manually-tuned merging coefficients, which often fail to resolve conflicts between tasks [33, 34]. To address this, AdaMerging introduces two key innovations:

  • Task-wise AdaMerging: Assigns a unique merging coefficient to each task vector, allowing the model to weigh tasks differently based on their contribution to overall performance.

  • Layer-wise AdaMerging: Further refines this approach by assigning distinct coefficients to each layer of each task vector, accounting for the observation that shallow layers typically encode generalizable features, while deeper layers capture task-specific nuances.

The innovation lies in how these coefficients are optimized. Rather than relying on grid search or labeled data, AdaMerging uses entropy minimization on unlabeled test samples as a surrogate objective. This is grounded in the finding that prediction entropy correlates strongly with model loss; lower entropy often indicates more confident and accurate predictions. By minimizing entropy during the merging process, AdaMerging effectively enhances model performance across tasks, even in the absence of supervision.

The experiments in the AdaMerging paper, conducted on diverse image classification tasks such as SUN397, SVHN, DTD, and EuroSAT using ViT-B/32 and ViT-L/14 models, demonstrate that layer-wise AdaMerging++ outperforms existing methods by up to 11% in average accuracy over task arithmetic baselines. Moreover, it shows:

  • Superior generalization to unseen tasks, with improvements of up to 9.1%.

  • Robustness under distribution shifts, outperforming competitors under corrupted data scenarios such as Gaussian noise and JPEG compression by up to 11.2%.

Importantly, this method is more accurate as well as more scalable because traditional techniques require manually tuning one coefficient per task or layer, which becomes infeasible at scale. AdaMerging eliminates this bottleneck through automated optimization using gradient descent and backpropagation on entropy. In the context of scalable transfer learning, AdaMerging aligns well with trends in model modularity and reusability. It preserves the benefits of individual models while unifying them into a single robust architecture. This has particular relevance for vision tasks with limited labeled data, where pretrained models are often fine-tuned independently for each domain or task.

In summary, AdaMerging represents a significant step forward in model merging for MTL, offering a practical, unsupervised, and data-efficient pathway to fuse multiple specialized models into one cohesive system. As such, it complements emerging paradigms like mixture of experts by providing a post-hoc merging strategy when dynamic routing or joint training is not feasible.

3.2.4 Scalable multi-task learning with disjoint datasets

Recent research in multi-task learning (MTL) increasingly explores scenarios where datasets are disjoint, meaning each task is associated with its own exclusive dataset, without overlapping samples or shared labels. This reflects many real-world applications, such as facial analysis or autonomous driving, where datasets for distinct tasks such as expression recognition and pose estimation come from different distributions or label spaces. Two key papers address this challenge using distinct and complementary strategies: MTL-SA [35] and heterogeneous task learning [36].

The paper “Beyond Without Forgetting: Multi-task Learning for Classification with Disjoint Datasets” introduces MTL-SA, a framework that builds upon the classic “Learning without Forgetting” (LwF) [37] paradigm, adapting it for disjoint datasets in classification tasks. In this setup, each task has a unique labeled dataset. In this scenario, conventional joint training fails because unlabeled datasets from other tasks are not directly utilized to improve learning. MTL-SA introduces a two-fold strategy to bridge this gap:

  1. Augment each task using pseudo-labeled data from the other task’s dataset.

  2. Selectively weight and integrate those samples based on pseudo-label confidence and domain similarity.

The method introduces label vector interpolation, where each unlabeled sample’s training label is a weighted combination of:

  • A soft label: output from the previous model epoch, preserving learned knowledge.

  • A pseudo label: converted from soft label by choosing the most confident class, adding new knowledge.

Their experimental results show that MTL-SA significantly outperformed both joint and alternating training baselines. For example, MTL-SA improves the expression recognition accuracy on SFEW dataset from 50.11% to 53.50% and age-stage classification accuracy on PETA dataset from 79.83% to 81.78%.

Moreover, MTL-SA excels in cases where training data is kept across tasks or domains and cannot be merged directly. It demonstrates how data selection and label interpolation can enhance MTL performance even in disjoint settings, making it highly relevant to transfer learning scenarios.

Another innovative method that tackle the challenge of jointly training on tasks with different data modalities, label spaces, and image properties is introduced in the paper “Multitask Learning with Heterogeneous Tasks”. These tasks use disjoint datasets which vary in complexity, resolution, and output types. In order to handle such heterogeneity, the authors propose an extension to the classic hard-parameter sharing strategy, by introducing task-specific stems before the shared encoder, where each task’s input passes through a stem unit that adapts its format such as resolution alignment, before reaching the common feature extractor. The overall structure involves a shared backbone with task-specific stems to preprocess inputs and separate prediction heads per task.

The proposed architecture is optimized using a weighted sum of individual task losses, similar to other MTL formulations. Unlike prior works like UberNet, which operates on a single dataset with multiple tasks, this study evaluates performance across four diverse datasets: CIFAR-10 and STL-10 for classification, MINI-COCO for object detection, and VOC0712 for semantic segmentation task. Furthermore, this work underscores the scalability benefits of MTL in resource-constrained settings, even across semantically and structurally diverse tasks. While performance on complex tasks like detection might slightly lag, the overall parameter reduction and performance gains on simpler tasks make this approach practical for real-world applications such as edge computing.

4. Conclusions

Transfer learning represents a significant paradigm shift within artificial intelligence field, particularly within computer vision area, by fostering more efficient, scalable, and versatile model development. Throughout this chapter, we have explored how modern AI practices have evolved from isolated single-task learning approaches toward unified multi-task and multi-dataset frameworks. By leveraging shared neural network architectures, transfer learning facilitates the reuse of learned representations, thereby significantly reducing computational overhead and training times.

Various advanced strategies for fine-tuning, including parameter-efficient fine-tuning (PEFT), mixture of experts (MoE), adaptive multi-task learning (MTL), and model merging, have been comprehensively reviewed. Techniques like LoRA, adapters, and vision prompt tuning (VPT) demonstrate that substantial improvements in efficiency and effectiveness can be achieved by updating only small subsets of pretrained weights or adding lightweight trainable modules. Moreover, integrating MoE architectures have notably enhanced the capabilities of models in handling multi-task and multi-domain challenges. These architectures dynamically route inputs to specialized subnetworks, thus effectively mitigating issues such as conflicting gradients, domain discrepancies, and data imbalances. Techniques like MoE-FFD and AdaMV-MoE further underscore the potential of these approaches by demonstrating substantial performance gains in complex, real-world scenarios.

Model merging strategies, such as adaptive merging methods like AdaMerging, illustrate another critical advancement, allowing the fusion of independently fine-tuned models into cohesive multi-task models without requiring additional labeled data. These methods have successfully addressed previous limitations such as sensitivity to merging coefficients and have showcased considerable improvements in scalability and generalization capabilities.

However, while these strategies offer promising avenues for advancing AI efficiency and adaptability, they also bring forth critical ethical and practical considerations. Issues such as potential biases in multi-dataset training and the energy costs associated with large-scale models underscore the importance of ongoing research into ethical and sustainable practices in model training and deployment.

The techniques and insights presented in this chapter broaden the applicability of transfer learning across diverse tasks and domains as well as significantly advance the pursuit of creating adaptable, robust, and resource-conscious AI systems. Continued innovation in these areas will undoubtedly shape the future landscape of artificial intelligence, aligning technological advancements with real-world demands and ethical imperative.

Acknowledgments

This publication has emanated from research supported in part by grants from Research Ireland under Grant numbers SFI/21/FFP-A/9174, SFI/12/RC/2289_P2. For the purpose of open access, the author has applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.

Abbreviations

AI

artificial intelligence

MTL

multi-task learning

MDL

multi-domain learning

TL

transfer learning

PEFT

parameter-efficient fine-tuning

MoE

mixture of experts

VPT

vision prompt tuning

LoRA

low-rank adaptation

FGVC

fine-grained visual classification

MLP

multi-layer perceptron

AES

adaptive expert selection

AUC

area under curve

AdaMV-MoE

adaptive multi-task vision mixture-of-experts

FFD

face forgery detection

ViT-B

vision transformer base

ViT-L

vision transformer large

MTL-SA

multi-task learning with selective augmentation

LwF

learning without forgetting

PCGrad

projected conflicting gradient

SSL

self-supervised learning

VLM

vision-language model

References

  1. 1. Krizhevsky A, Hinton GE. Learning multiple layers in deep convolutional neural networks for image classification. In: Pereira FC, Burges CJ, editors. Advances in Neural Information Processing Systems. Vol. 22. Red Hook, NY: Curran Associates, Inc.; 2009. pp. 1097-1105
  2. 2. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: Common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, editors. Computer Vision – ECCV 2014. Cham: Springer International Publishing; 2014. pp. 740-755
  3. 3. Zhang W, Shen L, Foo CS. Rethinking the role of pre-trained networks in source-free domain adaptation. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Societyp; 2022. pp. 18795-18805. Available from: https://api.semanticscholar.org/CorpusID:254686034
  4. 4. Kornblith S, Shlens J, Le QV. Do better ImageNet models transfer better?. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vol. 2019. Long Beach, CA: IEEE; 2019. pp. 2656-2666. DOI: 10.1109/CVPR.2019.00277
  5. 5. Zhang Y, Yang Q. A Survey on Multi-Task Learning. IEEE Transactions on Knowledge and Data Engineering. 2017. DOI: 10.1109/TKDE.2021.3070203
  6. 6. Zamir A, Sax A, Shen W, Guibas L, Malik J, Savarese S. Taskonomy: Disentangling task transfer learning. arXiv. 2018;1804:08328. DOI: 10.48550/arXiv.1804.08328
  7. 7. Guan H, Yap PT, Bozoki A, Liu M. Federated learning for medical image analysis: A survey. Pattern Recognition. 2023;151:110424
  8. 8. Zhang Y, Carballo A, Yang H, Takeda K. Perception and sensing for autonomous vehicles under adverse weather conditions: A survey. ISPRS Journal of Photogrammetry and Remote Sensing. 2023;196:146-177. Available from: https://www.sciencedirect.com/science/article/pii/S0924271622003367
  9. 9. Yeo T, Kar OF, Sax A, Zamir A. Robustness via cross-domain ensembles. arXiv. 2021;2103:1-12. Available from: https://arxiv.org/abs/2103.10919
  10. 10. Buolamwini J, Gebru T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In: Friedler SA, Wilson C, editors. Proceedings of the 1st Conference on Fairness, Accountability and Transparency. Vol. 81 of Proceedings of Machine Learning Research. [Internet]. Brookline, MA: PMLR; 2018. pp. 77-91. Available from: https://proceedings.mlr.press/v81/buolamwini18a.html
  11. 11. Strubell E, Ganesh A, McCallum A. Energy and policy considerations for deep learning in NLP. In: Korhonen A, Traum D, Màrquez L, editors. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics; 2019. pp. 3645-3650. Available from: https://aclanthology.org/P19-1355/
  12. 12. Kendall A, Gal Y, Cipolla R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. arXiv. 2018;1705:1-13. Available from: https://arxiv.org/abs/1705.07115
  13. 13. Yu T, Kumar S, Gupta A, Levine S, Hausman K, Finn C. Gradient surgery for multi-task learning. arXiv. 2020;2001:1-15. Available from: https://arxiv.org/abs/2001.06782
  14. 14. Wang X, Li H, Fang H, Peng Y, Xie H, Yang X, et al. LineArt: A knowledge-guided training-free high-quality appearance transfer for design drawing with diffusion model. arXiv. 2024;2412:1-14. Available from: https://arxiv.org/abs/2412.11519
  15. 15. Zhao B, Bilen H. Dataset condensation with distribution matching. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 2023 (WACV). IEEE Winter Conference on Applications of Computer Vision. United States: Institute of Electrical and Electronics Engineers; 2023. pp. 6503-6512. Available from: https://wacv2023.thecvf.com/
  16. 16. Wortsman M, Ilharco G, Gadre SY, Roelofs R, Gontijo-Lopes R, Morcos AS, et al. Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. arXiv. 2022;2203:1-17. Available from: https://arxiv.org/abs/2203.05482
  17. 17. He K, Chen X, Xie S, Li Y, Dollar P, Girshick R. Masked autoencoders are scalable vision learners. In: Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society; 2022. pp. 15979-15988
  18. 18. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. arXiv. 2021;2103:1-24. Available from: https://arxiv.org/abs/2103.00020
  19. 19. Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al. LoRA: Low-rank adaptation of large language models. arXiv. 2021;2106:1-15. Available from: https://arxiv.org/abs/2106.09685
  20. 20. Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo A, et al. Parameter-efficient transfer learning for NLP. In: Chaudhuri K, Salakhutdinov R, editors. Proceedings of the 36th International Conference on Machine Learning. Vol. 97 of Proceedings of Machine Learning Research. Brookline, MA: PMLR; 2019. pp. 2790-2799. Available from: https://proceedings.mlr.press/v97/houlsby19a.html
  21. 21. Xin Y, Luo S, Zhou H, Du J, Liu X, Fan Y, et al. Parameter-efficient fine-tuning for pre-trained vision models: A survey. arXiv. 2024;2401:1-20. Available from: https://api.semanticscholar.org/CorpusID:267412110
  22. 22. Jia M, Tang L, Chen BC, Cardie C, Belongie S, Hariharan B, et al. Visual Prompt Tuning. In: Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIII. Berlin, Heidelberg: Springer-Verlag; 2022. pp. 709-727. DOI: 10.1007/978-3-031-19827-4_41
  23. 23. Ben Zaken E, Goldberg Y, Ravfogel S. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Dublin, Ireland: Association for Computational Linguistics; 2022. pp. 1-9. Available from: https://aclanthology.org/2022.acl-short.1/
  24. 24. Xin Y, Du J, Wang Q, Lin Z, Yan K. VMT-adapter: Parameter-efficient transfer learning for multi-task dense scene understanding. arXiv. 2023;2312:1-14. Available from: https://arxiv.org/abs/2312.08733
  25. 25. Chen S, Ge C, Tong Z, Wang J, Song Y, Wang J, et al. AdaptFormer: Adapting vision transformers for scalable visual recognition. arXiv. 2022;2205:1-10. Available from: https://arxiv.org/abs/2205.13535
  26. 26. Sung YL, Cho J, Bansal M. LST: Ladder side-tuning for parameter and memory efficient transfer learning. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors. Advances in Neural Information Processing Systems. Vol. 35. Red Hook, NY: Curran Associates, Inc.; 2022. pp. 12991-13005. Available from: https://proceedings.neurips.cc/paper_files/paper/2022/file/54801e196796134a2b0ae5e8adef502f-Paper-Conference.pdf
  27. 27. Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A. Do vision transformers see like convolutional neural networks? In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Vaughan JW, editors. Advances in Neural Information Processing Systems. Vol. 34. Red Hook, NY: Curran Associates, Inc.; 2021. pp. 12116-12128. Available from: https://proceedings.neurips.cc/paper_files/paper/2021/file/652cf38361a209088302ba2b8b7f51e0-Paper.pdf
  28. 28. Xie E, Yao L, Shi H, Liu Z, Zhou D, Liu Z, et al. DiffFit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. arXiv. 2023;2304:1-15. Available from: https://arxiv.org/abs/2304.06648
  29. 29. Kong C, Luo A, Bao P, Yu Y, Li H, Zheng Z, et al. MoE-FFD: Mixture of experts for generalized and parameter-efficient face forgery detection. arXiv. 2024;2404:1-12. Available from: https://arxiv.org/abs/2404.08452
  30. 30. Chen T, Chen X, Du X, Rashwan A, Yang F, Chen H, et al. AdaMV-MoE: Adaptive multi-task vision mixture-of-experts. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA: IEEE Computer Societyp; 2023. pp. 17300-17311
  31. 31. Chen W, Du X, Yang F, Beyer L, Zhai X, Lin TY, et al. A simple single-scale vision transformer for object localization and instance segmentation. arXiv. 2022;2112:1-17. Available from: https://arxiv.org/abs/2112.09747
  32. 32. Yang E, Wang Z, Shen L, Liu S, Guo G, Wang X, Tao D. AdaMerging: Adaptive model merging for multi-task learning. In: Proceedings of the Twelfth International Conference on Learning Representations (ICLR). Vol. 2024. [Internet]. OpenReview; 2024. pp. 1-16. Available from: https://openreview.net/forum?id=nZP6NgD3QY
  33. 33. Ilharco G, Ribeiro MT, Wortsman M, Schmidt L, Hajishirzi H, Farhadi A. Editing models with task arithmetic. In: Proceedings of the Eleventh International Conference on Learning Representations (ICLR). arXiv preprint arXiv. 2023;2212:1-18. Available from: https://openreview.net/forum?id=6t0Kwf8-jrj
  34. 34. Yadav P, Tam D, Choshen L, Raffel C, Bansal M. TIES-Merging: Resolving interference when merging models. arXiv. 2023;2306:1-12. Available from: https://arxiv.org/abs/2306.01708
  35. 35. Hong Y, Niu L, Zhang J, Zhang L. Beyond without forgetting: multi-task learning for classification with disjoint datasets. arXiv. 2020;2003:1-10. Available from: https://arxiv.org/abs/2003.06746
  36. 36. Kim C, Kim E. Multitask learning with heterogeneous tasks. In: 2022 13th International Conference on Information and Communication Technology Convergence (ICTC). Piscataway, NJ: IEEE; 2022. pp. 1024-1026
  37. 37. Li Z, Hoiem D. Learning without forgetting. arXiv. 2017;1606:1-11. Available from: https://arxiv.org/abs/1606.09282

Written By

Tehseen Ullah, Steven Davy and John Kelleher

Submitted: 16 June 2025 Reviewed: 01 July 2025 Published: 03 August 2025