TL;DR
This paper introduces a novel multi-grained alignment method for contrastive language-audio pre-training, enhancing both coarse and fine-grained cross-modal understanding by unifying representations with a shared codebook and locality-aware techniques.
Contribution
It proposes a shared codebook and locality-aware block to improve multi-grained alignment, addressing limitations of previous models like CLAP.
Findings
Outperforms baseline CLAP significantly on multiple tasks
Achieves superior or competitive results compared to state-of-the-art methods
Enhances fine-grained and coarse-grained cross-modal alignment
Abstract
Recent advances have been witnessed in audio-language joint learning, such as CLAP, that shows much success in multi-modal understanding tasks. These models usually aggregate uni-modal local representations, namely frame or word features, into global ones, on which the contrastive loss is employed to reach coarse-grained cross-modal alignment. However, frame-level correspondence with texts may be ignored, making it ill-posed on explainability and fine-grained challenges which may also undermine performances on coarse-grained tasks. In this work, we aim to improve both coarse- and fine-grained audio-language alignment in large-scale contrastive pre-training. To unify the granularity and latent distribution of two modalities, a shared codebook is adopted to represent multi-modal global features with common bases, and each codeword is regularized to encode modality-shared semantics,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
