TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation
Pengyu Yan, Akhil Gorugantu, Mahesh Bhosale, Abdul Wasi, Vishvesh Trivedi, David Doermann

TL;DR
TRACE introduces a structured, evidence grounding-guided framework for multi-video event understanding, significantly improving factual accuracy and citation recall by combining OCR, object detection, and query-aware evidence localization.
Contribution
The paper proposes a novel ground-before-reasoning approach that enhances multi-video event reasoning by integrating structured evidence grounding with large vision-language models.
Findings
Improved macro-average MiRAGE F1 from 0.705 to 0.811 on MAGMaR 2026.
Enhanced citation recall from 0.440 to 0.628.
Achieved state-of-the-art results on MAGMaR 2026 leaderboard.
Abstract
Multi-video event understanding demands models that can locate and attribute query-relevant evidence scattered across long, heterogeneous video corpora. Existing large vision-language models (LVLMs) often underperform in this regime because they quickly exhaust their context budget and struggle to precisely localize evidentially important segments, frequently missing dense informational cues such as broadcast graphics, subtitles, and scoreboards. We introduce TRACE, an evidence grounding-guided framework that follows a ground-before-reasoning strategy for multi-video event reasoning. Our approach first builds a structured, text-searchable timeline for each video using OCR and object detection. A text-only LLM then conducts query-aware evidence localization, selecting relevant moments prior to any downstream visual reasoning. The retrieved frames and their grounding summaries are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
