Advancing Multi-grained Alignment for Contrastive Language-Audio   Pre-training

Yiming Li; Zhifang Guo; Xiangdong Wang; Hong Liu

arXiv:2408.07919·eess.AS·August 16, 2024·ACM Multimedia

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training

Yiming Li, Zhifang Guo, Xiangdong Wang, Hong Liu

PDF

1 Repo

TL;DR

This paper introduces a novel multi-grained alignment method for contrastive language-audio pre-training, enhancing both coarse and fine-grained cross-modal understanding by unifying representations with a shared codebook and locality-aware techniques.

Contribution

It proposes a shared codebook and locality-aware block to improve multi-grained alignment, addressing limitations of previous models like CLAP.

Findings

01

Outperforms baseline CLAP significantly on multiple tasks

02

Achieves superior or competitive results compared to state-of-the-art methods

03

Enhances fine-grained and coarse-grained cross-modal alignment

Abstract

Recent advances have been witnessed in audio-language joint learning, such as CLAP, that shows much success in multi-modal understanding tasks. These models usually aggregate uni-modal local representations, namely frame or word features, into global ones, on which the contrastive loss is employed to reach coarse-grained cross-modal alignment. However, frame-level correspondence with texts may be ignored, making it ill-posed on explainability and fine-grained challenges which may also undermine performances on coarse-grained tasks. In this work, we aim to improve both coarse- and fine-grained audio-language alignment in large-scale contrastive pre-training. To unify the granularity and latent distribution of two modalities, a shared codebook is adopted to represent multi-modal global features with common bases, and each codeword is regularized to encode modality-shared semantics,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ming-er/mga-clap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.