MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers
Tanvir Mahmud, Shentong Mo, Yapeng Tian, Diana Marculescu

TL;DR
This paper introduces MA-AVT, a parameter-efficient audio-visual transformer that employs deep modality alignment, hierarchical feature alignment, and foreground feature suppression to improve multimodal learning performance.
Contribution
It proposes a novel modality alignment method with joint token learning, blockwise contrastive learning, and foreground mining for enhanced audio-visual transformer performance.
Findings
Achieves state-of-the-art results on AVE, VGGSound, and CREMA-D datasets.
Demonstrates the effectiveness of deep hierarchical feature alignment.
Shows significant performance improvements over existing methods.
Abstract
Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However, few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper, we propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for corresponding multimodal semantic features. Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer. This allows the model to learn separate representations for each modality, while also attending to the cross-modal relationships between them. In addition, unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies
MethodsALIGN · Contrastive Learning
