YingSound: Video-Guided Sound Effects Generation with Multi-modal   Chain-of-Thought Controls

Zihao Chen; Haomin Zhang; Xinhan Di; Haoyu Wang; Sizhe Shan; Junjie; Zheng; Yunming Liang; Yihan Fan; Xinfa Zhu; Wenjie Tian; Yihua Wang; Chaofan; Ding; and Lei Xie

arXiv:2412.09168·cs.SD·December 13, 2024

YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Zihao Chen, Haomin Zhang, Xinhan Di, Haoyu Wang, Sizhe Shan, Junjie, Zheng, Yunming Liang, Yihan Fan, Xinfa Zhu, Wenjie Tian, Yihua Wang, Chaofan, Ding, and Lei Xie

PDF

Open Access

TL;DR

YingSound is a foundation model that generates high-quality, synchronized sound effects for videos in few-shot settings by aligning audio-visual features and employing a multi-modal chain-of-thought approach.

Contribution

The paper introduces YingSound, a novel multi-modal model with a chain-of-thought mechanism for video-guided sound generation in low-data scenarios.

Findings

01

Effective semantic alignment of audio-visual features achieved

02

High-quality synchronized sounds generated across diverse scenarios

03

Validated by automated and human evaluations

Abstract

Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Data Visualization and Analytics