TL;DR
HAFM is a hierarchical autoregressive model that generates instrumental music to accompany vocals, using novel tokenization and Transformer techniques to produce high-quality, time-aligned audio.
Contribution
The paper introduces a dual-rate tokenization scheme and a three-stage hierarchical architecture for improved music accompaniment generation.
Findings
HAFM achieves a Fréchet Audio Distance of 2.08 on MUSDB18.
It outperforms retrieval baselines in quality.
It matches state-of-the-art systems with fewer parameters.
Abstract
We present HAFM, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, HAFM produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innovations over prior work: (1) a dual-rate codec tokenization scheme using HuBERT semantic tokens at 50\,Hz for vocals and EnCodec acoustic tokens at 75\,Hz for instrumentals, enabling time-aligned yet rate-independent modeling; (2) a three-stage hierarchical autoregressive architecture (semantic to coarse acoustic to fine acoustic) with interleaved multi-codebook prediction and classifier-free guidance; and (3) modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias for improved training stability and sequence generalization. Experiments on MUSDB18 demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
