MiDashengLM: Efficient Audio Understanding with General Audio Captions

Heinrich Dinkel; Gang Li; Jizhong Liu; Jian Luan; Yadong Niu; Xingwei Sun; Tianzi Wang; Qiyang Xiao; Junbo Zhang; Jiahao Zhou

arXiv:2508.03983·cs.SD·March 27, 2026

MiDashengLM: Efficient Audio Understanding with General Audio Captions

Heinrich Dinkel, Gang Li, Jizhong Liu, Jian Luan, Yadong Niu, Xingwei Sun, Tianzi Wang, Qiyang Xiao, Junbo Zhang, Jiahao Zhou

PDF

10 Models

TL;DR

MiDashengLM is an open, efficient audio-language model that uses general audio captions for comprehensive understanding, achieving faster processing and higher throughput than existing models, while relying solely on publicly available datasets.

Contribution

Introduces MiDashengLM, a novel open-source audio-language model that processes diverse audio types with a new training dataset, emphasizing transparency and efficiency.

Findings

01

Up to 4x faster in time-to-first-token

02

Up to 20x higher throughput than comparable models

03

Uses publicly available datasets for training and fine-tuning

Abstract

Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.