ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Huadai Liu; Kaicheng Luo; Jialei Wang; Wen Wang; Qian Chen; Zhou Zhao; Wei Xue

arXiv:2506.21448·eess.AS·November 6, 2025

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Huadai Liu, Kaicheng Luo, Jialei Wang, Wen Wang, Qian Chen, Zhou Zhao, Wei Xue

PDF

Open Access 1 Repo 5 Models 1 Datasets

TL;DR

ThinkSound introduces a chain-of-thought reasoning framework for interactive, stepwise audio generation and editing in videos, combining multimodal reasoning with a new dataset to improve fidelity and controllability.

Contribution

It presents a novel CoT-based multimodal framework and dataset for structured reasoning in audio generation and editing from videos.

Findings

01

Achieves state-of-the-art video-to-audio generation performance.

02

Demonstrates effective interactive object-centric refinement.

03

Excels in out-of-distribution audio generation benchmarks.

Abstract

While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, this generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

FunAudioLLM/ThinkSound
pytorch

Models

Datasets

liuhuadai/AudioCoT
dataset· 55 dl
55 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Music Technology and Sound Studies