VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values
Zhe Hu, Yixiao Ren, Jing Li, Yu Yin

TL;DR
VIVA is a new benchmark designed to evaluate vision-language models' ability to incorporate human values into decision-making in real-world scenarios, highlighting current limitations and potential improvements.
Contribution
This work introduces VIVA, the first benchmark for assessing multimodal decision-making with human values in vision-language models.
Findings
VLMs show limited ability to use human values in decision-making.
Exploiting action consequences improves decision accuracy.
Predicted human values can enhance model performance.
Abstract
Large vision language models (VLMs) have demonstrated significant potential for integration into daily life, making it crucial for them to incorporate human values when making decisions in real-world situations. This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues. While most large VLMs focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,240 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComplex Systems and Decision Making
MethodsFocus
