Trust but Verify: Programmatic VLM Evaluation in the Wild
Viraj Prabhu, Senthil Purushwalkam, An Yan, Caiming Xiong, Ran Xu

TL;DR
This paper introduces PROVE, a new benchmark for evaluating vision-language models' responses by verifying their claims against detailed scene graphs, addressing hallucination issues in open-ended visual queries.
Contribution
We propose a novel benchmarking paradigm that uses scene graphs and programmatic verification to assess the helpfulness and truthfulness of VLM responses.
Findings
Few VLMs balance helpfulness and truthfulness effectively.
PROVE provides a challenging dataset of 10.5k QA pairs.
Programmatic evaluation correlates well with human judgment.
Abstract
Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. To construct PROVE, we provide a large language model (LLM) with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and prompt it to generate diverse question-answer (QA) pairs, as well as programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.5k challenging but visually grounded QA pairs. Next, to evaluate free-form model responses to queries in PROVE, we propose a…
Peer Reviews
Decision·Submitted to ICLR 2025
I do like how this method could evaluate VLM compositionally with a prepared set of programs, instead of all in a whole with a LLM. The design of helpfulness and truthfulness is interesting. It is interesting to find a way to evaluate hallucination in VLMs.
1. The presentation is poor. It took me a while to finally realize that this paper presents a way to evaluate image-captioning, through breaking the captioning task into VQA tasks, and the answers are evaluated by a program generated by GPT based on gold scene graph, instead of evaluated by GPT based on the gold caption. 2. The results in Table1 and example outputs in Figure 5 are very confusing. Why are the performance of all models look similar? From my personal experience, GPT-4o should be
- The paper is well-motivated and tackles an important research problem in VLMs evaluation. The inclusion of truthfulness in addition to helpfulness is thoughtful and often neglected. - The paper is generally well-written with clear definitions of the helpfulness and truthfulness metrics, and helpful illustrations like figure 4. - The evaluation covers a broad range of models. - The authors perform multiple data filtering steps to ensure the correctness of the programs and high quality of the
- The reviewer is mostly concerned about the use of models in multiple parts of the dataset generation, filtering, and evaluation pipeline, especially in extracting the scene graphs from captions. - For example, the scene graphs are not guaranteed to be completely accurate, as they are automatically extracted from the captions in the DOCCI dataset by an LLM without any human verification or filtering. - Similarly, as the authors mentioned, the sentence BERT model and visual entailment model OF
1. The definition of the two metrics, i.e. helpfulness and trustfulness, based on the scene graphs, is interesting. 2. The writing is clear and easy to follow.
1. Generalizability of the proposed evaluation paradigm is limited. The evaluation requires a dense scene graph and an executable program, which limits its usage to only the proposed dataset. The evaluation can be hardly generalized to images/questions without detailed annotations. Moreover, the evaluation’s effectiveness is bounded by the quality of the dense scene graph/detailed caption. Anything that is not in the scene graph cannot be evaluated. This is not exactly a “in-the-wild” evaluation
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStock Market Forecasting Methods
