Semantic visually-guided acoustic highlighting with large vision-language models

Junhua Huang; Chao Huang; Chenliang Xu

arXiv:2601.08871·cs.SD·January 15, 2026

Semantic visually-guided acoustic highlighting with large vision-language models

Junhua Huang, Chao Huang, Chenliang Xu

PDF

Open Access

TL;DR

This paper investigates how large vision-language models can improve audio remixing by identifying key visual cues like focus and tone that enhance perceptual audio-visual alignment in storytelling.

Contribution

It systematically evaluates visual-semantic cues from large models to determine their impact on automated audio remixing, highlighting practical cues for cinema-grade sound design.

Findings

01

Camera focus, tone, and scene background significantly improve audio-visual coherence.

02

Visual cues can be effectively extracted using large vision-language models.

03

The study provides a practical approach for automating high-quality sound design.

Abstract

Balancing dialogue, music, and sound effects with accompanying video is crucial for immersive storytelling, yet current audio mixing workflows remain largely manual and labor-intensive. While recent advancements have introduced the visually guided acoustic highlighting task, which implicitly rebalances audio sources using multimodal guidance, it remains unclear which visual aspects are most effective as conditioning signals.We address this gap through a systematic study of whether deep video understanding improves audio remixing. Using textual descriptions as a proxy for visual analysis, we prompt large vision-language models to extract six types of visual-semantic aspects, including object and character appearance, emotion, camera focus, tone, scene background, and inferred sound-related cues. Through extensive experiments, camera focus, tone, and scene background consistently yield…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing