MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual   Transformers

Tanvir Mahmud; Shentong Mo; Yapeng Tian; Diana Marculescu

arXiv:2406.04930·cs.CV·June 10, 2024·1 cites

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Tanvir Mahmud, Shentong Mo, Yapeng Tian, Diana Marculescu

PDF

Open Access 1 Repo

TL;DR

This paper introduces MA-AVT, a parameter-efficient audio-visual transformer that employs deep modality alignment, hierarchical feature alignment, and foreground feature suppression to improve multimodal learning performance.

Contribution

It proposes a novel modality alignment method with joint token learning, blockwise contrastive learning, and foreground mining for enhanced audio-visual transformer performance.

Findings

01

Achieves state-of-the-art results on AVE, VGGSound, and CREMA-D datasets.

02

Demonstrates the effectiveness of deep hierarchical feature alignment.

03

Shows significant performance improvements over existing methods.

Abstract

Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However, few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper, we propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for corresponding multimodal semantic features. Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer. This allows the model to learn separate representations for each modality, while also attending to the cross-modal relationships between them. In addition, unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

enyac-group/MA-AVT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies

MethodsALIGN · Contrastive Learning