Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning
Cheng-Hao Tu, Zheda Mai, Wei-Lun Chao

TL;DR
This paper introduces Visual Query Tuning (VQT), a memory-efficient method for leveraging intermediate features of Vision Transformers to improve transfer learning accuracy without full model fine-tuning.
Contribution
VQT is a novel approach that summarizes intermediate features with learnable queries, enhancing transfer learning performance while maintaining memory efficiency.
Findings
VQT outperforms state-of-the-art methods using intermediate features.
VQT surpasses full fine-tuning in many scenarios.
VQT complements existing parameter-efficient methods for improved accuracy.
Abstract
Intermediate features of a pre-trained model have been shown informative for making accurate predictions on downstream tasks, even if the model backbone is kept frozen. The key challenge is how to utilize these intermediate features given their gigantic amount. We propose visual query tuning (VQT), a simple yet effective approach to aggregate intermediate features of Vision Transformers. Through introducing a handful of learnable ``query'' tokens to each layer, VQT leverages the inner workings of Transformers to ``summarize'' rich intermediate features of each layer, which can then be used to train the prediction heads of downstream tasks. As VQT keeps the intermediate features intact and only learns to combine them, it enjoys memory efficiency in training, compared to many other parameter-efficient fine-tuning approaches that learn to adapt features and need back-propagation through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
