Can Sound Replace Vision in LLaVA With Token Substitution?

Ali Vosoughi; Jing Bi; Pinxin Liu; Yunlong Tang; Chenliang Xu

arXiv:2506.10416·cs.MM·August 7, 2025

Can Sound Replace Vision in LLaVA With Token Substitution?

Ali Vosoughi, Jing Bi, Pinxin Liu, Yunlong Tang, Chenliang Xu

PDF

Open Access 1 Datasets

TL;DR

This paper investigates the effects of extreme audio-visual alignment on perceptual models by creating a detailed dataset and analyzing how different encoder architectures respond to realignment in the CLIP space.

Contribution

It introduces a new dataset with granular alignment scores and systematically studies how image-centric and text-centric encoders behave under superaligned audio-visual representations.

Findings

01

Image-centric encoders excel in cross-modal retrieval but lose linguistic detail after alignment.

02

Text-centric encoders better preserve linguistic information during alignment.

03

Alignment impacts encoder performance differently based on their architectural design.

Abstract

What happens when we push audio-visual alignment to its absolute limits? To systematically investigate this question, we needed datasets with granular alignment quality annotations, but existing datasets treat alignment as binary, either synchronized or not. To address this limitation, we developed a comprehensive dataset featuring detailed alignment scores that reveal the hidden spectrum of audio-visual perceptual correspondence. Using these precise scores, we create "superaligned" representations by training exclusively on the most perfectly matched audio-visual pairs, then conduct our systematic investigation into how this extreme alignment transforms perceptual model behavior across retrieval and generation tasks. The encoders under study fall into two main groups consisting of image-centric encoders that were pretrained using visual modalities as intermediary hubs for connecting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ali-vosoughi/ave-2
dataset· 102 dl
102 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Multisensory perception and integration · Music and Audio Processing