Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

Junhyeok Lee; Xiluo He; Jihwan Lee; Helin Wang; Shrikanth Narayanan; Thomas Thebaud; Laureano Moro-Velazquez; Jes\'us Villalba; Najim Dehak

arXiv:2603.05887·eess.AS·March 9, 2026

Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

Junhyeok Lee, Xiluo He, Jihwan Lee, Helin Wang, Shrikanth Narayanan, Thomas Thebaud, Laureano Moro-Velazquez, Jes\'us Villalba, Najim Dehak

PDF

Open Access

TL;DR

This paper introduces a self-supervised representation reconstruction loss that improves neural audio codecs by enhancing intelligibility, accelerating training, and enabling low-latency streaming without lookahead, achieving state-of-the-art results.

Contribution

The paper proposes SSRR loss, a novel training method that significantly improves speech intelligibility and training efficiency in streaming neural audio codecs.

Findings

01

SSRR accelerates convergence, enabling training on a single GPU.

02

SSRR improves speech intelligibility without lookahead.

03

The proposed codec achieves state-of-the-art performance with minimal latency.

Abstract

Neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that self-supervised representation reconstruction (SSRR) loss fundamentally improves codec training and performance. First, SSRR significantly accelerates convergence, enabling competitive results using only a single GPU. Second, it enhances intelligibility by reconstructing distilled self-supervised representations from codec outputs. Third, SSRR enables high intelligibility without additional lookahead in streaming Transformer-based codecs, allowing a zero-lookahead architecture for real-time deployment. As a result, our JHCodec achieves state-of-the-art performance while maintaining minimal latency and reduced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Speech Recognition and Synthesis