ORCA: Open-ended Response Correctness Assessment for Audio Question Answering

\v{S}imon Sedl\'a\v{c}ek; Sara Barahona; Bolaji Yusuf; Laura Herrera-Alarc\'on; Santosh Kesiraju; Cecilia Bola\~nos; Alicia Lozano-Diez; Sathvik Udupa; Fernando L\'opez; Allison Ferner; Ramani Duraiswami; Jan \v{C}ernock\'y

arXiv:2512.09066·cs.SD·December 11, 2025

ORCA: Open-ended Response Correctness Assessment for Audio Question Answering

\v{S}imon Sedl\'a\v{c}ek, Sara Barahona, Bolaji Yusuf, Laura Herrera-Alarc\'on, Santosh Kesiraju, Cecilia Bola\~nos, Alicia Lozano-Diez, Sathvik Udupa, Fernando L\'opez, Allison Ferner, Ramani Duraiswami, Jan \v{C}ernock\'y

PDF

Open Access

TL;DR

ORCA introduces a novel framework for assessing open-ended audio question answering responses by modeling human judgment variability with Beta distributions, improving accuracy and efficiency over existing methods.

Contribution

The paper presents ORCA, a new approach that captures human judgment uncertainty in audio QA evaluation, combining structured annotation with probabilistic modeling for better benchmarking.

Findings

01

ORCA achieves 0.91 correlation with human judgments.

02

It provides uncertainty estimates alongside correctness scores.

03

Requires less compute than traditional LLM-based evaluators.

Abstract

Evaluating open-ended responses from large audio language models (LALMs) is challenging because human annotators often genuinely disagree on answer correctness due to multiple valid interpretations, partial correctness, and subjective judgment. Traditional metrics reporting only mean scores fail to capture this uncertainty. We present ORCA (Open-ended Response Correctness Assessment), a framework that models the variability in human judgments using Beta distributions to predict both expected correctness and uncertainty. Our three-stage annotation framework combines human judgment with structured feedback and iterative refinement to simultaneously curate training data and improve benchmark quality. We collected 11,721 annotations across 3,580 question-answer pairs from 15 LALMs on two audio QA benchmarks, achieving inter-annotator agreement of 0.82 (Krippendorff's alpha). ORCA achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Topic Modeling