Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

TL;DR
Omni-R1 demonstrates that fine-tuning a multi-modal LLM on audio question answering with reinforcement learning significantly improves performance, and surprisingly, text-only fine-tuning can also enhance audio-based reasoning capabilities.
Contribution
The paper introduces Omni-R1, a novel fine-tuning approach for multi-modal LLMs using GRPO, achieving state-of-the-art results on audio question answering benchmarks.
Findings
Omni-R1 achieves top accuracy on MMAU and MMAR benchmarks.
Performance gains are largely due to improved text-based reasoning.
Text-only fine-tuning also enhances audio-based task performance.
Abstract
We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU and MMAR benchmarks. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
