The challenge of realistic music generation: modelling raw audio at scale
Sander Dieleman, A\"aron van den Oord, Karen Simonyan

TL;DR
This paper addresses the challenge of generating realistic music by modeling raw audio directly, using autoregressive discrete autoencoders to capture long-range dependencies and produce stylistically consistent piano music.
Contribution
It introduces autoregressive discrete autoencoders for raw audio music generation, enabling models to capture long-range correlations and generate more realistic, stylistically consistent music.
Findings
Autoregressive models struggle with long-range dependencies in raw audio.
Autoregressive discrete autoencoders improve long-range correlation modeling.
Generated piano music maintains stylistic consistency over tens of seconds.
Abstract
Realistic music generation is a challenging task. When building generative models of music that are learnt from data, typically high-level representations such as scores or MIDI are used that abstract away the idiosyncrasies of a particular performance. But these nuances are very important for our perception of musicality and realism, so in this work we embark on modelling music in the raw audio domain. It has been shown that autoregressive models excel at generating raw audio waveforms of speech, but when applied to music, we find them biased towards capturing local signal structure at the expense of modelling long-range correlations. This is problematic because music exhibits structure at many different timescales. In this work, we explore autoregressive discrete autoencoders (ADAs) as a means to enable autoregressive models to capture long-range correlations in waveforms. We find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
DeepMind's AI Learns The Piano From The Masters of The Past· youtube
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis
