Stack-and-Delay: a new codebook pattern for music generation
Gael Le Lan, Varun Nagaraja, Ernie Chang, David Kant, Zhaoheng Ni,, Yangyang Shi, Forrest Iandola, Vikas Chandra

TL;DR
This paper introduces a novel stack-and-delay decoding method for hierarchical token-based music generation, significantly improving speed and quality over existing flat pattern decoding while maintaining efficiency.
Contribution
The paper proposes a new stack-and-delay decoding strategy that enhances inference speed and quality in hierarchical music generation models compared to traditional flat pattern decoding.
Findings
Decoding speed is four times faster than flat pattern decoding.
The new method nearly matches flat pattern quality in objective evaluations.
Subjective tests favor samples from the new model over competing approaches.
Abstract
In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, flattening the codebooks represents the highest quality decoding strategy, while being notoriously slow. To this end, we propose a novel stack-and-delay style of decoding strategy to improve upon the flat pattern decoding where generation speed is four times faster as opposed to vanilla flat decoding. This brings the inference time close to that of the delay decoding strategy, and allows for faster inference on GPU for small batch sizes. For the same inference efficiency budget as the delay pattern, we show that the proposed approach performs better in objective evaluations, almost closing the gap with the flat pattern in terms of quality. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
