Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions

Marcel Gibier; Nolwenn Celton; Rapha\"el Duroselle; Pierre Serrano; Olivier Boeffard; Jean-Fran\c{c}ois Bonastre

arXiv:2511.14307·cs.SD·November 19, 2025

Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions

Marcel Gibier, Nolwenn Celton, Rapha\"el Duroselle, Pierre Serrano, Olivier Boeffard, Jean-Fran\c{c}ois Bonastre

PDF

Open Access

TL;DR

This paper presents a novel approach to Audio Question Answering by combining SSL-based audio feature extraction, calibrated segment-level predictions, and instruction-tuned language models, achieving 62.6% accuracy.

Contribution

It introduces a new method integrating acoustic event reasoning with large language models using GRPO-based fine-tuning for AQA.

Findings

01

Achieved 62.6% accuracy on the DCASE 2025 Challenge development set.

02

Demonstrated the effectiveness of combining acoustic event predictions with instruction-tuned models.

03

Validated the approach's potential for improved audio question answering performance.

Abstract

In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering(AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events, following the Audioset ontology. These segment-level predictions are subsequently calibrated before producing event-level predictions. Finally, these predictions are incorporated into a structured prompt, along with the question and candidate answers. This prompt is then fed to a fine-tuned version of Qwen2.5-7B-Instruct, trained using the GRPO algorithm with a simple reward function. Our method achieves an accuracy of 62.6 % on the development set, demonstrating the effectiveness of combining acoustic event reasoning with instruction-tuned large language models for AQA.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling