DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding

Jiaming Zhou; Xuxin Cheng; Shiwan Zhao; Yuhang Jia; Cao Liu; Ke Zeng; Xunliang Cai; Yong Qin

arXiv:2601.23161·cs.SD·February 2, 2026

DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding

Jiaming Zhou, Xuxin Cheng, Shiwan Zhao, Yuhang Jia, Cao Liu, Ke Zeng, Xunliang Cai, Yong Qin

PDF

Open Access 1 Models

TL;DR

DIFFA-2 introduces a practical diffusion-based large language model for general audio understanding, demonstrating improved performance and efficiency over previous models through a comprehensive training curriculum and open-source implementation.

Contribution

It presents DIFFA-2, a diffusion-based LALM that enhances audio understanding with a novel training pipeline and architectural upgrades, making diffusion models viable for large-scale audio tasks.

Findings

01

DIFFA-2 outperforms DIFFA in multiple benchmarks.

02

DIFFA-2 is competitive with autoregressive models under practical budgets.

03

The model is trained solely on open-source data.

Abstract

Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof-of-concept scale without large-scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
zhoujiaming777/DIFFA-2
model· 40 dl· ♡ 1
40 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing