TL;DR
This paper introduces Masked Dual Path Distillation, a method that accelerates inference in transfer learning by discarding side networks post-training, maintaining efficiency and improving accuracy across vision and language tasks.
Contribution
It proposes a novel framework that enhances transfer learning efficiency by mutually distilling backbone and side networks and discarding the side network during inference.
Findings
Accelerates inference by at least 25.2%
Maintains parameter and memory efficiency during fine-tuning
Improves accuracy over state-of-the-art methods
Abstract
Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
