Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model

Yong Ren; Chenxing Li; Le Xu; Hao Gu; Duzhen Zhang; Yujie Chen; Manjie Xu; Ruibo Fu; Shan Yang; Dong Yu

arXiv:2505.13062·cs.MM·May 29, 2025

Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model

Yong Ren, Chenxing Li, Le Xu, Hao Gu, Duzhen Zhang, Yujie Chen, Manjie Xu, Ruibo Fu, Shan Yang, Dong Yu

PDF

Open Access

TL;DR

This paper explores whether multimodal large language models can infer sounds from silent videos, introducing a new task and dataset, and demonstrating improved reasoning capabilities through Chain-of-Thought fine-tuning.

Contribution

The paper introduces the SVAD task, constructs the CoT-AudioCaps dataset, and proposes a Chain-of-Thought fine-tuning method to enhance VLMs' modal-mismatch reasoning.

Findings

01

Significant improvement in modal-mismatch reasoning for SVAD

02

Effective acquisition of audio descriptions during inference

03

Enhanced performance on VT2A tasks

Abstract

Humans can intuitively infer sounds from silent videos, but whether multimodal large language models can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions from Silent Videos (SVAD) to address this challenge and investigate vision-language models' (VLMs) capabilities on this task. To further enhance the VLMs' reasoning capacity for the SVAD task, we construct a CoT-AudioCaps dataset and propose a Chain-of-Thought-based supervised fine-tuning strategy. Experiments on SVAD and subsequent VT2A tasks demonstrate our method's effectiveness in two key aspects: significantly improving VLMs' modal-mismatch reasoning for SVAD and effectively addressing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Subtitles and Audiovisual Media