Semantic-Aware Confidence Calibration for Automated Audio Captioning

Lucas Dunker; Sai Akshay Menta; Snigdha Mohana Addepalli; Venkata Krishna Rayalu Garapati

arXiv:2512.10170·cs.SD·December 12, 2025

Semantic-Aware Confidence Calibration for Automated Audio Captioning

Lucas Dunker, Sai Akshay Menta, Snigdha Mohana Addepalli, Venkata Krishna Rayalu Garapati

PDF

Open Access

TL;DR

This paper introduces a semantic-aware confidence calibration framework for automated audio captioning, improving the reliability of confidence estimates by integrating semantic similarity measures and confidence prediction into the model.

Contribution

It presents a novel approach that combines confidence prediction with semantic evaluation, enhancing calibration and caption quality in audio captioning models.

Findings

01

Significantly improved calibration (ECE of 0.071) over baselines (ECE of 0.488)

02

Enhanced caption quality across standard metrics

03

Semantic similarity measures outperform n-gram metrics for confidence calibration

Abstract

Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, and the absence of calibrated confidence estimation. We present a framework that addresses both limitations by integrating confidence prediction into audio captioning and redefining correctness through semantic similarity. Our approach augments a Whisper-based audio captioning model with a learned confidence prediction head that estimates uncertainty from decoder hidden states. We employ CLAP audio-text embeddings and sentence transformer similarities (FENSE) to define semantic correctness, enabling Expected Calibration Error (ECE) computation that reflects true caption quality rather than surface-level text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech Recognition and Synthesis