DIFFA: Large Language Diffusion Models Can Listen and Understand

Jiaming Zhou; Hongjie Chen; Shiwan Zhao; Jian Kang; Jie Li; Enzhi Wang; Yujie Guo; Haoqin Sun; Hui Wang; Aobo Kong; Yong Qin; Xuelong Li

arXiv:2507.18452·cs.SD·November 11, 2025

DIFFA: Large Language Diffusion Models Can Listen and Understand

Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li

PDF

Open Access 1 Models 1 Video

TL;DR

DIFFA introduces a diffusion-based large audio-language model that effectively understands spoken language, demonstrating competitive performance with limited training data and opening new avenues for speech-driven AI.

Contribution

This work is the first to apply diffusion models to large-scale audio-language understanding, integrating a frozen diffusion model with a dual-adapter architecture and a two-stage training pipeline.

Findings

01

DIFFA outperforms autoregressive baselines on major benchmarks.

02

Effective with only 960 hours of ASR data and 127 hours of synthetic instruction data.

03

Demonstrates potential of diffusion models for scalable audio understanding.

Abstract

Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based large audio-language model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
zhoujiaming777/DIFFA
model· 3 dl
3 dl

Videos

DIFFA: Large Language Diffusion Models Can Listen and Understand· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling