Trust but Verify: Programmatic VLM Evaluation in the Wild

Viraj Prabhu; Senthil Purushwalkam; An Yan; Caiming Xiong; Ran Xu

arXiv:2410.13121·cs.CV·October 18, 2024

Trust but Verify: Programmatic VLM Evaluation in the Wild

Viraj Prabhu, Senthil Purushwalkam, An Yan, Caiming Xiong, Ran Xu

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces PROVE, a new benchmark for evaluating vision-language models' responses by verifying their claims against detailed scene graphs, addressing hallucination issues in open-ended visual queries.

Contribution

We propose a novel benchmarking paradigm that uses scene graphs and programmatic verification to assess the helpfulness and truthfulness of VLM responses.

Findings

01

Few VLMs balance helpfulness and truthfulness effectively.

02

PROVE provides a challenging dataset of 10.5k QA pairs.

03

Programmatic evaluation correlates well with human judgment.

Abstract

Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. To construct PROVE, we provide a large language model (LLM) with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and prompt it to generate diverse question-answer (QA) pairs, as well as programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.5k challenging but visually grounded QA pairs. Next, to evaluate free-form model responses to queries in PROVE, we propose a…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

I do like how this method could evaluate VLM compositionally with a prepared set of programs, instead of all in a whole with a LLM. The design of helpfulness and truthfulness is interesting. It is interesting to find a way to evaluate hallucination in VLMs.

Weaknesses

1. The presentation is poor. It took me a while to finally realize that this paper presents a way to evaluate image-captioning, through breaking the captioning task into VQA tasks, and the answers are evaluated by a program generated by GPT based on gold scene graph, instead of evaluated by GPT based on the gold caption. 2. The results in Table1 and example outputs in Figure 5 are very confusing. Why are the performance of all models look similar? From my personal experience, GPT-4o should be

Reviewer 02Rating 5Confidence 4

Strengths

- The paper is well-motivated and tackles an important research problem in VLMs evaluation. The inclusion of truthfulness in addition to helpfulness is thoughtful and often neglected. - The paper is generally well-written with clear definitions of the helpfulness and truthfulness metrics, and helpful illustrations like figure 4. - The evaluation covers a broad range of models. - The authors perform multiple data filtering steps to ensure the correctness of the programs and high quality of the

Weaknesses

- The reviewer is mostly concerned about the use of models in multiple parts of the dataset generation, filtering, and evaluation pipeline, especially in extracting the scene graphs from captions. - For example, the scene graphs are not guaranteed to be completely accurate, as they are automatically extracted from the captions in the DOCCI dataset by an LLM without any human verification or filtering. - Similarly, as the authors mentioned, the sentence BERT model and visual entailment model OF

Reviewer 03Rating 5Confidence 4

Strengths

1. The definition of the two metrics, i.e. helpfulness and trustfulness, based on the scene graphs, is interesting. 2. The writing is clear and easy to follow.

Weaknesses

1. Generalizability of the proposed evaluation paradigm is limited. The evaluation requires a dense scene graph and an executable program, which limits its usage to only the proposed dataset. The evaluation can be hardly generalized to images/questions without detailed annotations. Moreover, the evaluation’s effectiveness is bounded by the quality of the dense scene graph/detailed caption. Anything that is not in the scene graph cannot be evaluated. This is not exactly a “in-the-wild” evaluation

Code & Models

Datasets

Salesforce/PROVE
dataset· 82 dl
82 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStock Market Forecasting Methods