Covo-Audio Technical Report

Wenfu Wang; Chenxing Li; Liqiang Zhang; Yiyang Zhao; Yuxiang Zou; Hanzhao Li; Mingyu Cui; Hao Zhang; Kun Wei; Le Xu; Zikang Huang; Jiajun Xu; Jiliang Hu; Xiang He; Zeyu Xie; Jiawen Kang; Youjun Chen; Meng Yu; Dong Yu; Rilin Chen; Linlin Di; Shulin Feng; Na Hu; Yang Liu; Bang Wang; Shan Yang

arXiv:2602.09823·cs.SD·March 17, 2026

Covo-Audio Technical Report

Wenfu Wang, Chenxing Li, Liqiang Zhang, Yiyang Zhao, Yuxiang Zou, Hanzhao Li, Mingyu Cui, Hao Zhang, Kun Wei, Le Xu, Zikang Huang, Jiajun Xu, Jiliang Hu, Xiang He, Zeyu Xie, Jiawen Kang, Youjun Chen, Meng Yu, Dong Yu, Rilin Chen, Linlin Di, Shulin Feng, Na Hu, Yang Liu

PDF

Open Access 2 Models

TL;DR

Covo-Audio is a 7B-parameter end-to-end model that processes and generates audio, achieving state-of-the-art performance in speech and audio understanding, dialogue, and full-duplex interaction, with a decoupled voice customization strategy.

Contribution

This work introduces Covo-Audio, a unified architecture for audio processing and understanding, and proposes an intelligence-speaker decoupling method for flexible voice customization.

Findings

01

Achieves state-of-the-art performance across multiple audio tasks.

02

Demonstrates strong conversational and reasoning capabilities.

03

Enables flexible voice customization with minimal TTS data.

Abstract

In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including speech-text modeling, spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction. Extensive evaluations demonstrate that the pretrained foundation model exhibits strong speech-text comprehension and semantic reasoning capabilities on multiple benchmarks, outperforming representative open-source models of comparable scale. Furthermore, Covo-Audio-Chat, the dialogue-oriented variant, demonstrates strong spoken conversational abilities, including understanding, contextual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing