SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models
Xingjian Diao, Chunhui Zhang, Keyi Kong, Weiyi Wu, Chiyu Ma, Zhongyu Ouyang, Peijun Qing, Soroush Vosoughi, Jiang Gui

TL;DR
This paper introduces SoundMind, a new dataset and reinforcement learning method to enhance reasoning abilities in audio-language models, demonstrating significant improvements over existing baselines.
Contribution
It presents a novel dataset and RL algorithm specifically designed for audio logical reasoning, advancing the capabilities of audio-language models.
Findings
Improved reasoning performance on the SoundMind benchmark
Effective fine-tuning of Qwen2.5-Omni-7B with the new dataset
Demonstrated benefits of combining high-quality data with RL techniques
Abstract
While large language models have demonstrated impressive reasoning abilities, their extension to the audio modality, particularly within large audio-language models (LALMs), remains underexplored. Addressing this gap requires a systematic approach that involves a capable base model, high-quality reasoning-oriented audio data, and effective training algorithms. In this work, we present a comprehensive solution for audio logical reasoning (ALR) tasks: we introduce SoundMind, a dataset of 6,446 audio-text annotated samples specifically curated to support complex reasoning. Building on this resource, we propose SoundMind-RL, a rule-based reinforcement learning (RL) algorithm designed to equip audio-language models with robust audio-text reasoning capabilities. By fine-tuning Qwen2.5-Omni-7B on the proposed SoundMind dataset using SoundMind-RL, we achieve strong and consistent improvements…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies
MethodsBalanced Selection
