Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP
Tornike Karchkhadze, Shlomo Dubnov

TL;DR
This paper introduces a real-time AI music accompaniment system using latent diffusion models integrated with MAX/MSP, enabling live musical interactions with reduced latency and high coherence.
Contribution
It presents a novel framework combining diffusion-based AI models with real-time music tools, achieving low-latency accompaniment generation in a widely-used environment.
Findings
Achieved 5.4x reduction in sampling time with consistency distillation.
Both models operate in real-time with strong musical coherence.
Performance degrades gracefully with increased look-ahead, balancing latency and quality.
Abstract
We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playback-with a Python inference server running the generative model, communicating via OSC/UDP messages. This allows musicians to perform in MAX/MSP - a well-established, real-time capable environment - while interacting with a large-scale Python-based generative model, overcoming the fundamental disconnect between real-time music tools and state-of-the-art AI models. We formulate accompaniment generation as a sliding-window look-ahead protocol, training the model to predict future audio from partial context, where system latency is a critical constraint. To reduce latency, we apply consistency distillation to our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
