AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

William Chen; Prem Seetharaman; Rithesh Kumar; Oriol Nieto; Shinji Watanabe; Justin Salamon; Zeyu Jin

arXiv:2602.17097·cs.SD·February 20, 2026

AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, Zeyu Jin

PDF

Open Access

TL;DR

AudioChat is a novel framework that enables unified generation, editing, and understanding of complex multi-source audio stories using a new training paradigm and evaluation metrics, advancing audio foundation modeling.

Contribution

The paper introduces AudioChat, a new paradigm with Audio Transfusion Forcing and dialogue-based training for comprehensive audio story processing.

Findings

01

Effective multi-source audio story generation and editing.

02

New metrics for task-specific evaluation.

03

Demonstrated capabilities via a public demo.

Abstract

Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis