Semantic visually-guided acoustic highlighting with large vision-language models
Junhua Huang, Chao Huang, Chenliang Xu

TL;DR
This paper investigates how large vision-language models can improve audio remixing by identifying key visual cues like focus and tone that enhance perceptual audio-visual alignment in storytelling.
Contribution
It systematically evaluates visual-semantic cues from large models to determine their impact on automated audio remixing, highlighting practical cues for cinema-grade sound design.
Findings
Camera focus, tone, and scene background significantly improve audio-visual coherence.
Visual cues can be effectively extracted using large vision-language models.
The study provides a practical approach for automating high-quality sound design.
Abstract
Balancing dialogue, music, and sound effects with accompanying video is crucial for immersive storytelling, yet current audio mixing workflows remain largely manual and labor-intensive. While recent advancements have introduced the visually guided acoustic highlighting task, which implicitly rebalances audio sources using multimodal guidance, it remains unclear which visual aspects are most effective as conditioning signals.We address this gap through a systematic study of whether deep video understanding improves audio remixing. Using textual descriptions as a proxy for visual analysis, we prompt large vision-language models to extract six types of visual-semantic aspects, including object and character appearance, emotion, camera focus, tone, scene background, and inferred sound-related cues. Through extensive experiments, camera focus, tone, and scene background consistently yield…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing
