Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

Jinchuan Tian; Haoran Wang; Bo-Hao Su; Chien-yu Huang; Qingzheng Wang; Jiatong Shi; William Chen; Xun Gong; Siddhant Arora; Chin-Jou Li; Masao Someki; Takashi Maekaku; Keita Goto; Yusuke Shinohara; Jin Sakuma; Chao-Han Huck Yang; Shinji Watanabe

arXiv:2602.05220·cs.CL·February 24, 2026

Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

Jinchuan Tian, Haoran Wang, Bo-Hao Su, Chien-yu Huang, Qingzheng Wang, Jiatong Shi, William Chen, Xun Gong, Siddhant Arora, Chin-Jou Li, Masao Someki, Takashi Maekaku, Keita Goto, Yusuke Shinohara, Jin Sakuma, Chao-Han Huck Yang, Shinji Watanabe

PDF

Open Access

TL;DR

Bagpiper is an 8-billion-parameter audio foundation model that uses rich natural language captions to understand and generate diverse audio content, enabling unified audio understanding and synthesis without task-specific training.

Contribution

Introduces Bagpiper, a novel audio foundation model that leverages rich captions for holistic audio understanding and generation, trained on 600B tokens, with a caption-then-process workflow for diverse tasks.

Findings

01

Outperforms existing models on audio understanding benchmarks.

02

Generates high-quality speech, music, and sound effects.

03

Achieves unified audio understanding and synthesis.

Abstract

Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications