Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation

Jinxing Zhou; Zhihui Li; Yongqiang Yu; Yanghao Zhou; Ruohao Guo; Guangyao Li; Yuxin Mao; Mingfei Han; Xiaojun Chang; Meng Wang

arXiv:2506.23271·cs.CV·July 1, 2025

Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation

Jinxing Zhou, Zhihui Li, Yongqiang Yu, Yanghao Zhou, Ruohao Guo, Guangyao Li, Yuxin Mao, Mingfei Han, Xiaojun Chang, Meng Wang

PDF

Open Access

TL;DR

Mettle introduces a memory-efficient, meta-token learning approach for adapting large-scale transformer models to audio-visual tasks, enabling effective task-specific adaptation with reduced resource requirements.

Contribution

The paper proposes a novel Layer-Centric Distillation and Meta-Token Injection framework for efficient audio-visual model adaptation, improving memory usage and training speed.

Findings

01

Reduces memory consumption and training time significantly.

02

Maintains competitive accuracy on multiple benchmarks.

03

Supports both classification and segmentation tasks.

Abstract

We present \textbf{Met}a-\textbf{T}oken \textbf{Le}arning (Mettle), a simple and memory-efficient method for adapting large-scale pretrained transformer models to downstream audio-visual tasks. Instead of sequentially modifying the output feature distribution of the transformer backbone, Mettle utilizes a lightweight \textit{Layer-Centric Distillation (LCD)} module to distill in parallel the intact audio or visual features embedded by each transformer layer into compact meta-tokens. This distillation process considers both pretrained knowledge preservation and task-specific adaptation. The obtained meta-tokens can be directly applied to classification tasks, such as audio-visual event localization and audio-visual video parsing. To further support fine-grained segmentation tasks, such as audio-visual segmentation, we introduce a \textit{Meta-Token Injection (MTI)} module, which utilizes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications