QuarkAudio Technical Report

Chengwei Liu; Haoyin Yan; Shaofei Xue; Xiaotao Liang; Xiaofu Chen; Bin Gong; Zheng Xue; Gang Song

arXiv:2512.20151·eess.AS·December 24, 2025

QuarkAudio Technical Report

Chengwei Liu, Haoyin Yan, Shaofei Xue, Xiaotao Liang, Xiaofu Chen, Bin Gong, Zheng Xue, Gang Song

PDF

Open Access 3 Models

TL;DR

QuarkAudio introduces a unified autoregressive framework with a novel audio tokenizer, enabling multiple audio processing and generation tasks, including speech restoration, voice conversion, and natural language-guided audio editing, with high efficiency and quality.

Contribution

It presents QuarkAudio, a versatile, decoder-only language model-based framework with a new high-fidelity audio tokenizer, unifying diverse audio tasks in a single system.

Findings

01

High-quality audio reconstruction with low frame rate.

02

Competitive performance across multiple audio tasks.

03

Effective natural language-guided audio editing.

Abstract

Many existing audio processing and generation models rely on task-specific architectures, resulting in fragmented development efforts and limited extensibility. It is therefore promising to design a unified framework capable of handling multiple tasks, while providing robust instruction and audio understanding and high-quality audio generation. This requires a compatible paradigm design, a powerful backbone, and a high-fidelity audio reconstruction module. To meet these requirements, this technical report introduces QuarkAudio, a decoder-only autoregressive (AR) LM-based generative framework that unifies multiple tasks. The framework includes a unified discrete audio tokenizer, H-Codec, which incorporates self-supervised learning (SSL) representations into the tokenization and reconstruction process. We further propose several improvements to H-Codec, such as a dynamic frame-rate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing