DIFFA: Large Language Diffusion Models Can Listen and Understand
Jiaming Zhou, Hongjie Chen, Shiwan Zhao, Jian Kang, Jie Li, Enzhi Wang, Yujie Guo, Haoqin Sun, Hui Wang, Aobo Kong, Yong Qin, Xuelong Li

TL;DR
DIFFA introduces a diffusion-based large audio-language model that effectively understands spoken language, demonstrating competitive performance with limited training data and opening new avenues for speech-driven AI.
Contribution
This work is the first to apply diffusion models to large-scale audio-language understanding, integrating a frozen diffusion model with a dual-adapter architecture and a two-stage training pipeline.
Findings
DIFFA outperforms autoregressive baselines on major benchmarks.
Effective with only 960 hours of ASR data and 127 hours of synthetic instruction data.
Demonstrates potential of diffusion models for scalable audio understanding.
Abstract
Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based large audio-language model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
