YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
Zihao Chen, Haomin Zhang, Xinhan Di, Haoyu Wang, Sizhe Shan, Junjie, Zheng, Yunming Liang, Yihan Fan, Xinfa Zhu, Wenjie Tian, Yihua Wang, Chaofan, Ding, and Lei Xie

TL;DR
YingSound is a foundation model that generates high-quality, synchronized sound effects for videos in few-shot settings by aligning audio-visual features and employing a multi-modal chain-of-thought approach.
Contribution
The paper introduces YingSound, a novel multi-modal model with a chain-of-thought mechanism for video-guided sound generation in low-data scenarios.
Findings
Effective semantic alignment of audio-visual features achieved
High-quality synchronized sounds generated across diverse scenarios
Validated by automated and human evaluations
Abstract
Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Data Visualization and Analytics
