Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities
Parthasaarathy Sudarsanam, Irene Mart\'in-Morat\'o, Tuomas Virtanen

TL;DR
This paper introduces a single-stage contrastive learning method for aligning audio, visual, and text modalities simultaneously, outperforming traditional two-stage approaches in multimodal representation learning.
Contribution
It presents a novel unified training framework that improves multimodal alignment by jointly optimizing all three modalities in a single stage.
Findings
Two-fold improvement in audio-visual retrieval accuracy
Single-stage training outperforms two-stage methods
Effective use of AVCaps dataset for multimodal learning
Abstract
This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align visual-text and audio-text modalities. This approach suffers from mismatched data distributions, resulting in suboptimal alignment. Leveraging the AVCaps dataset, which provides audio, visual and audio-visual captions for video clips, our method jointly optimizes the representation of all the modalities using contrastive training. Our results demonstrate that the single-stage approach outperforms the two-stage method, achieving a two-fold improvement in audio based visual retrieval, highlighting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
