OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models
Monika Wysocza\'nska, Shyamal Buch, Anurag Arnab, Cordelia Schmid

TL;DR
OVFact introduces a new reference-free metric for evaluating the factual accuracy of long captions generated by vision-language models, leveraging open-vocabulary grounding and tool-based verification, which aligns better with human judgment and enables data filtering.
Contribution
The paper presents OV-Fact, a novel open-vocabulary, reference-free factuality metric for long captions that improves evaluation accuracy and supports data filtering for better model training.
Findings
OV-Fact aligns better with human judgments of factuality.
Filtering training data with OV-Fact improves model factuality.
The method effectively captures caption descriptiveness and factual precision.
Abstract
Large vision-language models (VLMs) often struggle to generate long and factual captions. However, traditional measures for hallucination and factuality are not well suited for evaluating longer, more diverse captions and in settings where ground-truth human-annotated captions are unavailable. We introduce OV-Fact, a novel method for measuring caption factuality of long captions that leverages open-vocabulary visual grounding and tool-based verification without depending on human annotations. Our method improves agreement with human judgments and captures both caption descriptiveness (recall) and factual precision in the same metric. Furthermore, unlike previous metrics, our reference-free method design enables new applications towards factuality-based data filtering. We observe models trained on an OVFact-filtered (2.5-5x less) subset of a large-scale, noisy (VLM-generated) pretraining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Subtitles and Audiovisual Media
