VeS: Teaching Pixels to Listen Without Supervision

Sajay Raj

arXiv:2507.22008·cs.CV·July 30, 2025

VeS: Teaching Pixels to Listen Without Supervision

Sajay Raj

PDF

TL;DR

This paper demonstrates that dense token routing significantly improves multilingual audio-visual retrieval and localization in low-resource, noisy settings, even with a frozen vision backbone.

Contribution

It introduces a dense contrastive learning objective for multilingual AV tasks, showing substantial gains over global pooling methods in low-resource scenarios.

Findings

01

Dense objective improves R@1 by 59% over global pooling.

02

Sharp zero-shot localization heatmaps are produced despite frozen vision backbone.

03

Dense token routing is especially effective in low-resource, noisy environments.

Abstract

Recent dense audio-visual (AV) models achieve impressive retrieval and emergent localization, but almost all evidence comes from English-centric, caption-rich web video. It is unclear whether these objectives survive in low-resource, code-switched, and noisy multilingual settings that typify developing regions. We show they do**-**and that the choice of aggregation function becomes even more critical. Using a multilingual subset of Project Vaani spanning dozens of Indian languages and dialectal variants, we compare three contrastive objectives: (i) a global mean-pooled loss (CLIP-style), (ii) a dense max-mean token matcher (DenseAV-style), and (iii) a simple hybrid (motivated by frozen-vision alignment strategies). The dense objective delivers a +59% relative R@1 (Audio Visual) improvement over global pooling and substantially lower mean/median ranks, while consistently producing sharp…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.