Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

Parthasaarathy Sudarsanam; Irene Mart\'in-Morat\'o; Tuomas Virtanen

arXiv:2505.14562·cs.SD·May 21, 2025

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

Parthasaarathy Sudarsanam, Irene Mart\'in-Morat\'o, Tuomas Virtanen

PDF

Open Access

TL;DR

This paper introduces a single-stage contrastive learning method for aligning audio, visual, and text modalities simultaneously, outperforming traditional two-stage approaches in multimodal representation learning.

Contribution

It presents a novel unified training framework that improves multimodal alignment by jointly optimizing all three modalities in a single stage.

Findings

01

Two-fold improvement in audio-visual retrieval accuracy

02

Single-stage training outperforms two-stage methods

03

Effective use of AVCaps dataset for multimodal learning

Abstract

This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align visual-text and audio-text modalities. This approach suffers from mismatched data distributions, resulting in suboptimal alignment. Leveraging the AVCaps dataset, which provides audio, visual and audio-visual captions for video clips, our method jointly optimizes the representation of all the modalities using contrastive training. Our results demonstrate that the single-stage approach outperforms the two-stage method, achieving a two-fold improvement in audio based visual retrieval, highlighting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing