ARGUS: Hallucination and Omission Evaluation in Video-LLMs

Ruchit Rawal; Reza Shirkavand; Heng Huang; Gowthami Somepalli; Tom Goldstein

arXiv:2506.07371·cs.CV·June 11, 2025

ARGUS: Hallucination and Omission Evaluation in Video-LLMs

Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, Tom Goldstein

PDF

Open Access 1 Datasets

TL;DR

ARGUS is a new benchmark for evaluating Video-LLMs on freeform video captioning, specifically measuring hallucinations and omissions to better understand their true performance.

Contribution

The paper introduces ARGUS, a benchmark that assesses hallucination and omission in Video-LLMs during freeform captioning tasks, addressing limitations of existing multiple-choice benchmarks.

Findings

01

Video-LLMs hallucinate more in freeform tasks than in multiple-choice.

02

ARGUS effectively quantifies hallucination and omission rates.

03

Benchmark reveals significant gaps in current Video-LLM capabilities.

Abstract

Video large language models have not yet been widely deployed, largely due to their tendency to hallucinate. Typical benchmarks for Video-LLMs rely simply on multiple-choice questions. Unfortunately, VideoLLMs hallucinate far more aggressively on freeform text generation tasks like video captioning than they do on multiple choice verification tasks. To address this weakness, we propose ARGUS, a VideoLLM benchmark that measures freeform video captioning performance. By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics. First, we measure the rate of hallucinations in the form of incorrect statements about video content or temporal relationships. Second, we measure the rate at which the model omits important descriptive details. Together, these dual metrics form a comprehensive view of video captioning performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tomg-group-umd/argus
dataset· 1.1k dl
1.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling