Streaming Generation for Music Accompaniment

Yusong Wu; Mason Wang; Heidi Lei; Stephen Brade; Lancelot Blanchard; Shih-Lun Wu; Aaron Courville; Anna Huang

arXiv:2510.22105·cs.SD·October 28, 2025

Streaming Generation for Music Accompaniment

Yusong Wu, Mason Wang, Heidi Lei, Stephen Brade, Lancelot Blanchard, Shih-Lun Wu, Aaron Courville, Anna Huang

PDF

TL;DR

This paper introduces a real-time audio-to-audio music accompaniment model that balances latency, coherence, and throughput, addressing system delays and proposing advanced training objectives for live performance scenarios.

Contribution

It presents a novel model design considering system delays, explores the trade-offs between future visibility and chunk size, and highlights the need for anticipatory training objectives for coherent live accompaniment.

Findings

01

Increasing future visibility improves coherence but demands faster inference.

02

Larger output chunks increase throughput but reduce update frequency and quality.

03

Naive training methods are insufficient for real-time coherent accompaniment.

Abstract

Music generation models can produce high-fidelity coherent accompaniment given complete audio input, but are limited to editing and loop-based workflows. We study real-time audio-to-audio accompaniment: as a model hears an input audio stream (e.g., a singer singing), it has to also simultaneously generate in real-time a coherent accompanying stream (e.g., a guitar accompaniment). In this work, we propose a model design considering inevitable system delays in practical deployment with two design variables: future visibility $t_{f}$ , the offset between the output playback time and the latest input time used for conditioning, and output chunk duration $k$ , the number of frames emitted per call. We train Transformer decoders across a grid of $(t_{f}, k)$ and show two consistent trade-offs: increasing effective $t_{f}$ improves coherence by reducing the recency gap, but requires faster inference to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.