PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

Huadai Liu; Kaicheng Luo; Wen Wang; Qian Chen; Peiwen Sun; Rongjie Huang; Xiangang Li; Jieping Ye; Wei Xue

arXiv:2511.18833·cs.SD·March 4, 2026

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

Huadai Liu, Kaicheng Luo, Wen Wang, Qian Chen, Peiwen Sun, Rongjie Huang, Xiangang Li, Jieping Ye, Wei Xue

PDF

Open Access 1 Models

TL;DR

PrismAudio introduces a novel reinforcement learning framework with specialized chain-of-thought modules and multi-dimensional rewards for improved video-to-audio generation, addressing objective entanglement and enhancing perceptual quality.

Contribution

It is the first to integrate RL with decomposed reasoning modules and tailored rewards for multidimensional video-to-audio generation, improving interpretability and performance.

Findings

01

Achieves state-of-the-art results across perceptual dimensions

02

Introduces a new balanced benchmark for V2A tasks

03

Reduces training overhead with hybrid ODE-SDE sampling

Abstract

Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
FunAudioLLM/PrismAudio
model· ♡ 69
♡ 69

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Speech and Audio Processing