Do Neural Codecs Generalize? A Controlled Study Across Unseen Languages and Non-Speech Tasks

Shih-Heng Wang; Jiatong Shi; Jinchuan Tian; Haibin Wu; Shinji Watanabe

arXiv:2601.12205·cs.SD·January 21, 2026

Do Neural Codecs Generalize? A Controlled Study Across Unseen Languages and Non-Speech Tasks

Shih-Heng Wang, Jiatong Shi, Jinchuan Tian, Haibin Wu, Shinji Watanabe

PDF

Open Access

TL;DR

This study systematically evaluates neural audio codecs' ability to generalize across unseen languages and non-speech tasks, revealing that including diverse pre-training data enhances their versatility without sacrificing speech performance.

Contribution

It provides a controlled, fair comparison of NACs trained from scratch, demonstrating how pre-training data influences their generalization to new languages and non-speech applications.

Findings

01

NACs can generalize to unseen languages during pre-training.

02

Speech-only pre-trained NACs perform poorly on non-speech tasks.

03

Including non-speech data improves non-speech task performance while maintaining speech quality.

Abstract

This paper investigates three crucial yet underexplored aspects of the generalization capabilities of neural audio codecs (NACs): (i) whether NACs can generalize to unseen languages during pre-training, (ii) whether speech-only pre-trained NACs can effectively generalize to non-speech applications such as environmental sounds, music, and animal vocalizations, and (iii) whether incorporating non-speech data during pre-training can improve performance on both speech and non-speech tasks. Existing studies typically rely on off-the-shelf NACs for comparison, which limits insight due to variations in implementation. In this work, we train NACs from scratch using strictly controlled configurations and carefully curated pre-training data to enable fair comparisons. We conduct a comprehensive evaluation of NAC performance on both signal reconstruction quality and downstream applications using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research