MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong; Kuangwei Chen; Zhaoye Fei; Xiaogui Yang; Ke Chen; Yang Wang; Kexin Huang; Mingshu Chen; Ruixiao Li; Qingyuan Cheng; Shimin Li; Xipeng Qiu

arXiv:2602.10934·cs.SD·February 13, 2026

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, Xipeng Qiu

PDF

Open Access 4 Models

TL;DR

This paper introduces MOSS-Audio-Tokenizer, a fully end-to-end, Transformer-based audio tokenizer trained on large-scale data, enabling high-fidelity audio reconstruction and advancing audio foundation models.

Contribution

It proposes the CAT architecture for scalable, homogeneous audio tokenization and demonstrates its effectiveness across multiple audio domains and tasks.

Findings

01

Outperforms prior codecs across diverse audio types and bitrates.

02

Supports high-fidelity reconstruction with increased scale.

03

Enables the first autoregressive TTS model surpassing previous systems.

Abstract

Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing