Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning

Chinthani Sugandhika; Chen Li; Deepu Rajan; Basura Fernando

arXiv:2512.05513·cs.CV·April 1, 2026

Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning

Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando

PDF

1 Repo

TL;DR

Know-Show introduces a comprehensive benchmark for evaluating spatio-temporal grounded reasoning in Video-Language Models, highlighting current gaps and proposing a plug-in method to improve interpretability and reliability.

Contribution

The paper presents a new benchmark, Know-Show, for assessing grounded reasoning in video-language models and proposes GRAM, a training-free augmentation method to enhance reasoning capabilities.

Findings

01

Existing Video-LMs significantly lag behind human reasoning in grounded understanding.

02

The GRAM plug-in improves models' ability to reason about actions and their semantics.

03

Know-Show provides a unified framework for evaluating spatial and temporal grounding in videos.

Abstract

Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K high-quality human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LUNAProject22/Know-Show
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.