Improving X-Codec-2.0 for Multi-Lingual Speech: 25 Hz Latent Rate and 24 kHz Sampling

Husein Zolkepli

arXiv:2601.20185·cs.CL·March 10, 2026

Improving X-Codec-2.0 for Multi-Lingual Speech: 25 Hz Latent Rate and 24 kHz Sampling

Husein Zolkepli

PDF

Open Access 1 Models

TL;DR

This paper enhances X-Codec-2.0 by reducing the latent rate to 25 Hz and increasing the sampling rate to 24 kHz, leading to better efficiency and perceptual quality in multilingual speech compression.

Contribution

Introducing a simple modification with pooling and increased decoder hop size to improve X-Codec-2.0's efficiency and audio fidelity without changing the core architecture.

Findings

01

Achieved a 0.29 MOS improvement over the baseline.

02

Attained the best performance among codecs at 25 Hz.

03

Demonstrated improved efficiency and perceptual quality.

Abstract

X-Codec-2.0 has shown strong performance in neural audio compression and multilingual speech modeling, operating at a 50 Hz latent rate and a 16 kHz sampling rate using frozen HuBERT features. While effective, this configuration limits temporal efficiency and audio fidelity. In this work, we explore a simple and effective modification by introducing additional pooling and increasing the decoder hop size. This reduces the latent rate from 50 Hz to 25 Hz and simultaneously raises the output sampling rate from 16 kHz to 24 kHz, improving efficiency and perceptual quality without altering the core architecture. Evaluated on the multilingual Common Voice 17 test set, the proposed configuration achieves a 0.29 MOS improvement over the original X-Codec-2.0 baseline based on UTMOSv2, and attains the best reported performance among all codecs operating at 25 Hz. The source code, checkpoints, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Scicom-intl/xcodec2-25TPS-24k
model· 72 dl· ♡ 8
72 dl♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques · Speech and Audio Processing