Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Andrew Rouditchenko; Saurabhchand Bhati; Edson Araujo; Samuel Thomas; Hilde Kuehne; Rogerio Feris; James Glass

arXiv:2505.09439·eess.AS·November 24, 2025

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

PDF

Open Access

TL;DR

Omni-R1 demonstrates that fine-tuning a multi-modal LLM on audio question answering with reinforcement learning significantly improves performance, and surprisingly, text-only fine-tuning can also enhance audio-based reasoning capabilities.

Contribution

The paper introduces Omni-R1, a novel fine-tuning approach for multi-modal LLMs using GRPO, achieving state-of-the-art results on audio question answering benchmarks.

Findings

01

Omni-R1 achieves top accuracy on MMAU and MMAR benchmarks.

02

Performance gains are largely due to improved text-based reasoning.

03

Text-only fine-tuning also enhances audio-based task performance.

Abstract

We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU and MMAR benchmarks. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing