VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding
Xueqing Yu, Bohan Li, Yan Li, Zhenheng Yang

TL;DR
VirtueBench is a new benchmark designed to evaluate the trustworthiness of vision-language models in long video understanding, focusing on their ability to refuse answering when uncertain, thus promoting more reliable AI systems.
Contribution
The paper introduces VirtueBench, a benchmark that assesses model trustworthiness under uncertainty by constructing multiple frame-sampling levels and providing ground truths for answerability.
Findings
Refusal accuracy varies significantly across models.
Most models tend to answer even when uncertain.
Benchmark reveals models' refusal behaviors and reliability issues.
Abstract
Recent Vision-Language Models (VLMs) have made remarkable progress in multimodal understanding tasks, yet their evaluation on long video understanding remains unreliable. Due to limited frame inputs, key frames necessary for answering the question may be missing from the model's input. However, models that truthfully refuse to answer under such uncertainty are marked as incorrect, while those that guess may coincidentally produce the correct answer and thus obtain deceptively higher accuracy, leading to misleading evaluation results and encouraging models to guess rather than respond honestly. To address this issue, we introduce VirtueBench, a benchmark explicitly designed to assess model trustworthiness under uncertainty. VirtueBench constructs multiple frame-sampling levels for each video and provides ground truths that distinguish between answerable and unanswerable cases.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
