OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement

Jingbin Hu; Haoyu Zhang; Dake Guo; Qirui Zhan; Wenhao Li; Huakang Chen; Guobin Ma; Hanke Xie; Chengyou Wang; Pengyuan Xie; Chuan Xie; Qiang Zhang; Lei Xie

arXiv:2603.20638·eess.AS·March 24, 2026

OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement

Jingbin Hu, Haoyu Zhang, Dake Guo, Qirui Zhan, Wenhao Li, Huakang Chen, Guobin Ma, Hanke Xie, Chengyou Wang, Pengyuan Xie, Chuan Xie, Qiang Zhang, Lei Xie

PDF

Open Access

TL;DR

OmniCodec is a universal low frame rate neural audio codec that effectively disentangles semantic and acoustic information, achieving superior reconstruction quality and more meaningful representations across diverse audio domains.

Contribution

It introduces a hierarchical multi-codebook design with semantic-acoustic decoupling and a self-guidance strategy, enabling low frame rate modeling for various audio types.

Findings

01

Outperforms Mimi codec at the same bitrate

02

Provides more semantically informative representations

03

Enhances downstream audio generation tasks

Abstract

Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including speech, music, and general sound. Moreover, high reconstruction quality does not necessarily yield semantically informative representations, limiting effectiveness in downstream generation tasks. We propose OmniCodec, a universal neural audio codec tailored for low frame rate. It adopts a hierarchical multi-codebook design with semantic-acoustic decoupling by leveraging the audio encoder of the pre-trained understanding model, along with a self-guidance strategy to improve codebook utilization and reconstruction. Compared with the Mimi codec, experiments show that OmniCodec achieves outstanding performance at the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech Recognition and Synthesis · Speech and Audio Processing