ARGUS: Hallucination and Omission Evaluation in Video-LLMs
Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, Tom Goldstein

TL;DR
ARGUS is a new benchmark for evaluating Video-LLMs on freeform video captioning, specifically measuring hallucinations and omissions to better understand their true performance.
Contribution
The paper introduces ARGUS, a benchmark that assesses hallucination and omission in Video-LLMs during freeform captioning tasks, addressing limitations of existing multiple-choice benchmarks.
Findings
Video-LLMs hallucinate more in freeform tasks than in multiple-choice.
ARGUS effectively quantifies hallucination and omission rates.
Benchmark reveals significant gaps in current Video-LLM capabilities.
Abstract
Video large language models have not yet been widely deployed, largely due to their tendency to hallucinate. Typical benchmarks for Video-LLMs rely simply on multiple-choice questions. Unfortunately, VideoLLMs hallucinate far more aggressively on freeform text generation tasks like video captioning than they do on multiple choice verification tasks. To address this weakness, we propose ARGUS, a VideoLLM benchmark that measures freeform video captioning performance. By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics. First, we measure the rate of hallucinations in the form of incorrect statements about video content or temporal relationships. Second, we measure the rate at which the model omits important descriptive details. Together, these dual metrics form a comprehensive view of video captioning performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
