Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation

Xilin Jiang; Junkai Wu; Vishal Choudhari; Nima Mesgarani

arXiv:2505.06803·cs.SD·May 13, 2025

Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation

Xilin Jiang, Junkai Wu, Vishal Choudhari, Nima Mesgarani

PDF

Open Access

TL;DR

This paper evaluates audio, visual, and multimodal large language models against humans in sound recognition tasks, revealing sensory gaps and proposing cross-modal distillation to improve modality-specific perception.

Contribution

It systematically compares LLMs across modalities and introduces a cross-modal distillation framework to reduce sensory gaps, aligning models more closely with human perception.

Findings

01

Performance gap between audio and visual LLMs parallels human sensory discrepancy.

02

Cross-modal distillation improves sound recognition, especially in challenging classes.

03

Method enhances modality-specific perception in multimodal LLMs.

Abstract

Audio large language models (LLMs) are considered experts at recognizing sound objects, yet their performance relative to LLMs in other sensory modalities, such as visual or audio-visual LLMs, and to humans using their ears, eyes, or both remains unexplored. To investigate this, we systematically evaluate audio, visual, and audio-visual LLMs, specifically Qwen2-Audio, Qwen2-VL, and Qwen2.5-Omni, against humans in recognizing sound objects of different classes from audio-only, silent video, or sounded video inputs. We uncover a performance gap between Qwen2-Audio and Qwen2-VL that parallels the sensory discrepancy between human ears and eyes. To reduce this gap, we introduce a cross-modal distillation framework, where an LLM in one modality serves as the teacher and another as the student, with knowledge transfer in sound classes predicted as more challenging to the student by a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Domain Adaptation and Few-Shot Learning