Fun-Audio-Chat Technical Report

Tongyi Fun Team; Qian Chen; Luyao Cheng; Chong Deng; Xiangang Li; Jiaqing Liu; Chao-Hong Tan; Wen Wang; Junhao Xu; Jieping Ye; Qinglin Zhang; Qiquan Zhang; Jingren Zhou

arXiv:2512.20156·cs.CL·January 21, 2026

Fun-Audio-Chat Technical Report

Tongyi Fun Team, Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, Qinglin Zhang, Qiquan Zhang, Jingren Zhou

PDF

Open Access 1 Models 1 Datasets

TL;DR

Fun-Audio-Chat is a novel large audio language model that balances efficiency and quality through dual-resolution speech representations and mitigates catastrophic forgetting via core-cocktail training, enabling robust audio understanding and interaction.

Contribution

The paper introduces Fun-Audio-Chat, combining dual-resolution speech representations and a new training method to improve audio-text modeling without large-scale pre-training.

Findings

01

Achieves top performance on Spoken QA benchmarks.

02

Demonstrates competitive results on Speech-to-Text and Speech-to-Speech tasks.

03

Enables full-duplex voice interactions with strong performance.

Abstract

Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
FunAudioLLM/Fun-Audio-Chat-8B
model· 3.7k dl· ♡ 182
3.7k dl♡ 182

Datasets

FunAudioLLM/SpeechFCEval
dataset· 457 dl
457 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research