Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions
Marcel Gibier, Nolwenn Celton, Rapha\"el Duroselle, Pierre Serrano, Olivier Boeffard, Jean-Fran\c{c}ois Bonastre

TL;DR
This paper presents a novel approach to Audio Question Answering by combining SSL-based audio feature extraction, calibrated segment-level predictions, and instruction-tuned language models, achieving 62.6% accuracy.
Contribution
It introduces a new method integrating acoustic event reasoning with large language models using GRPO-based fine-tuning for AQA.
Findings
Achieved 62.6% accuracy on the DCASE 2025 Challenge development set.
Demonstrated the effectiveness of combining acoustic event predictions with instruction-tuned models.
Validated the approach's potential for improved audio question answering performance.
Abstract
In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering(AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events, following the Audioset ontology. These segment-level predictions are subsequently calibrated before producing event-level predictions. Finally, these predictions are incorporated into a structured prompt, along with the question and candidate answers. This prompt is then fed to a fine-tuned version of Qwen2.5-7B-Instruct, trained using the GRPO algorithm with a simple reward function. Our method achieves an accuracy of 62.6 % on the development set, demonstrating the effectiveness of combining acoustic event reasoning with instruction-tuned large language models for AQA.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling
