Semantic-Aware Confidence Calibration for Automated Audio Captioning
Lucas Dunker, Sai Akshay Menta, Snigdha Mohana Addepalli, Venkata Krishna Rayalu Garapati

TL;DR
This paper introduces a semantic-aware confidence calibration framework for automated audio captioning, improving the reliability of confidence estimates by integrating semantic similarity measures and confidence prediction into the model.
Contribution
It presents a novel approach that combines confidence prediction with semantic evaluation, enhancing calibration and caption quality in audio captioning models.
Findings
Significantly improved calibration (ECE of 0.071) over baselines (ECE of 0.488)
Enhanced caption quality across standard metrics
Semantic similarity measures outperform n-gram metrics for confidence calibration
Abstract
Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, and the absence of calibrated confidence estimation. We present a framework that addresses both limitations by integrating confidence prediction into audio captioning and redefining correctness through semantic similarity. Our approach augments a Whisper-based audio captioning model with a learned confidence prediction head that estimates uncertainty from decoder hidden states. We employ CLAP audio-text embeddings and sentence transformer similarities (FENSE) to define semantic correctness, enabling Expected Calibration Error (ECE) computation that reflects true caption quality rather than surface-level text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech Recognition and Synthesis
