HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Dan Ben-Ami; Gabriele Serussi; Kobi Cohen; Chaim Baskin

arXiv:2512.14870·cs.CV·April 3, 2026

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin

PDF

1 Repo 1 Datasets

TL;DR

HERBench is a new benchmark for VideoQA that requires models to integrate multiple evidence cues across video segments, revealing significant challenges in current Video-LLMs.

Contribution

The paper introduces HERBench, a benchmark with higher evidential demands and a novel metric, MRFS, to evaluate multi-evidence integration in VideoQA models.

Findings

01

Current models achieve only 31-42% accuracy on HERBench.

02

HERBench imposes higher evidential demand than previous benchmarks.

03

Two main bottlenecks identified: retrieval and fusion deficits.

Abstract

Video Large Language Models (Video-LLMs) are improving rapidly, yet current Video Question Answering (VideoQA) benchmarks often admit single-cue shortcuts, under-testing reasoning that must integrate evidence across time. We introduce HERBench, a benchmark designed to make multi-evidence integration unavoidable: each question requires at least three non-overlapping cues drawn from distinct video segments. HERBench contains 26,806 five-way multiple-choice questions across 12 compositional tasks. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes higher evidential demand than prior benchmarks. Evaluating 13 state-of-the-art Video-LLMs yields only 31-42% accuracy, only modestly above the 20\% random-guess baseline. We disentangle this failure into two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

danbenami/HERBench
github

Datasets

DanBenAmi/HERBench
dataset· 629 dl
629 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.