Simple Token-Level Confidence Improves Caption Correctness
Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell,, Anna Rohrbach, Marcus Rohrbach

TL;DR
This paper introduces Token-Level Confidence (TLC), a simple method to evaluate caption correctness in vision-language models, significantly improving accuracy in understanding and reducing hallucinations.
Contribution
The paper proposes TLC, a token-level confidence measure, that enhances caption correctness assessment and outperforms previous methods in various vision-language tasks.
Findings
TLC improves verb understanding accuracy by 10%.
TLC reduces object hallucination rates by 30%.
Learned confidence estimators further enhance performance.
Abstract
The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet surprisingly effective method to assess caption correctness. Specifically, we fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate either algebraic or learned token confidences over words or sequences to estimate image-caption consistency. Compared to sequence-level scores from pretrained models, TLC with algebraic confidence measures achieves a relative improvement in accuracy by 10% on verb understanding in SVO-Probes and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Simple Token-Level Confidence Improves Caption Correctness· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsTest-time Local Converter
