VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding

Xueqing Yu; Bohan Li; Yan Li; Zhenheng Yang

arXiv:2603.07071·cs.CV·March 11, 2026

VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding

Xueqing Yu, Bohan Li, Yan Li, Zhenheng Yang

PDF

Open Access

TL;DR

VirtueBench is a new benchmark designed to evaluate the trustworthiness of vision-language models in long video understanding, focusing on their ability to refuse answering when uncertain, thus promoting more reliable AI systems.

Contribution

The paper introduces VirtueBench, a benchmark that assesses model trustworthiness under uncertainty by constructing multiple frame-sampling levels and providing ground truths for answerability.

Findings

01

Refusal accuracy varies significantly across models.

02

Most models tend to answer even when uncertain.

03

Benchmark reveals models' refusal behaviors and reliability issues.

Abstract

Recent Vision-Language Models (VLMs) have made remarkable progress in multimodal understanding tasks, yet their evaluation on long video understanding remains unreliable. Due to limited frame inputs, key frames necessary for answering the question may be missing from the model's input. However, models that truthfully refuse to answer under such uncertainty are marked as incorrect, while those that guess may coincidentally produce the correct answer and thus obtain deceptively higher accuracy, leading to misleading evaluation results and encouraging models to guess rather than respond honestly. To address this issue, we introduce VirtueBench, a benchmark explicitly designed to assess model trustworthiness under uncertainty. VirtueBench constructs multiple frame-sampling levels for each video and provides ground truths that distinguish between answerable and unanswerable cases.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning