Harmonic-Percussive Disentangled Neural Audio Codec for Bandwidth Extension
Beno\^it Gini\`es, Xiaoyu Bie, Olivier Fercoq, Ga\"el Richard

TL;DR
This paper proposes a novel neural audio codec combined with a transformer-based model to improve bandwidth extension by disentangling harmonic and percussive components, leading to high-quality audio reconstruction.
Contribution
It introduces a new codec design that explicitly guides disentanglement for better token prediction in bandwidth extension tasks.
Findings
High-quality audio reconstruction demonstrated by objective metrics.
Effective coupling of codec structure with transformer modeling.
Disentanglement based on harmonic-percussive decomposition enhances spectral relevance.
Abstract
Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent advances in neural architectures have significantly improved performance across a wide range of audio tasks, In this work, we extend these advances by framing bandwidth extension as an audio token prediction problem. Specifically, we train a transformer-based language model on the discrete representations produced by a disentangled neural audio codec, where the disentanglement is guided by a Harmonic-Percussive decomposition of the input signals, highlighting spectral structures particularly relevant for bandwidth extension. Our approach introduces a novel codec design that explicitly accounts for the downstream…
Peer Reviews
Decision·Submitted to ICLR 2026
- The model can do 16khz-to-48khz BWE
## Missing baseline for the codec part In Sec.3.1.1, authors explicitly mentioned that the architecture is inspired by DAC. Given this relationship, it is essential to evaluate the performance of a standard DAC model in Table.1. Unfortunately, DAC is not included. As a reader, I would be interested in the benchmark of more standard codecs, including DAC, EnCodec, or more recent MuCodec and SpectroStream. All these models offer pretrained 48kHz or 44.1kHz weights. Without the comparison with a
Conceptually aligned inductive bias. Splitting harmonic/percussive/residual structure matches known spectral decompositions in music/audio and is a reasonable prior for bandwidth extension. * The model decomposition (two-branch codec -> semantic RVQ -> LM) is straightforward and can be done with standard components. * Ablations on harmonic and percussive inputs show that the intended sections contribute as designed. * The proposed method outperforms strong baselines on many objective metrics
* The system is essentially a composition of existing components (RVQ codec + token LM). The semantic split is a hand-crafted heuristic, not a new modeling paradigm, and the architecture is also taken from existing work. Novelty overall is therefore limited. * One of the core claims regarding the necessity of the semantic split is not demonstrated. The ablation removes the semantic split but replaces three deeper transformers with a single-layer LM, and only makes very vague claims as to why ("
The idea of structuring the latent space of a neural audio codec to explicitly align with a downstream task is a very strong and promising direction. The authors' design, which introduces a harmonic-percussive decomposition as an inductive bias, is well-motivated by principles of audio signal processing and aims to create a more predictable and interpretable representation for the language model. The empirical results presented are a clear strength of this work. The proposed HP-codecX consistent
While the results are impressive, I find the paper's fundamental premise to be insufficiently justified, which constitutes a significant weakness. The entire approach is built on using a neural audio codec, which is an inherently lossy compression process. The paper does not adequately explain why introducing this information bottleneck is a desirable step for a high-fidelity restoration task like bandwidth extension. It seems counter-intuitive to first discard information through compression, o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
