Resource-Efficient Reference-Free Evaluation of Audio Captions
Rehana Mahfuz, Yinyi Guo, Erik Visser

TL;DR
This paper introduces lightweight, well-calibrated confidence metrics for evaluating audio captions without references, suitable for resource-limited environments, and analyzes their effectiveness and calibration methods.
Contribution
It presents novel, resource-efficient confidence metrics for reference-free caption evaluation and explores their calibration and alignment with correctness measures.
Findings
Confidence metrics can effectively replace reference-based correctness measures.
Temperature scaling improves the calibration of confidence metrics.
Some confidence metrics align better with specific correctness measures.
Abstract
To establish the trustworthiness of systems that automatically generate text captions for audio, images and video, existing reference-free metrics rely on large pretrained models which are impractical to accommodate in resource-constrained settings. To address this, we propose some metrics to elicit the model's confidence in its own generation. To assess how well these metrics replace correctness measures that leverage reference captions, we test their calibration with correctness measures. We discuss why some of these confidence metrics align better with certain correctness measures. Further, we provide insight into why temperature scaling of confidence metrics is effective. Our main contribution is a suite of well-calibrated lightweight confidence metrics for reference-free evaluation of captions in resource-constrained settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Speech Recognition and Synthesis
