GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?

Yiyang Zhou; Linjie Li; Shi Qiu; Zhengyuan Yang; Yuyang Zhao; Siwei Han; Yangfan He; Kangqi Li; Haonian Ji; Zihao Zhao; Haibo Tong; Lijuan Wang; Huaxiu Yao

arXiv:2507.09491·cs.CV·July 15, 2025

GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?

Yiyang Zhou, Linjie Li, Shi Qiu, Zhengyuan Yang, Yuyang Zhao, Siwei Han, Yangfan He, Kangqi Li, Haonian Ji, Zihao Zhao, Haibo Tong, Lijuan Wang, Huaxiu Yao

PDF

Open Access 1 Datasets 1 Video

TL;DR

GLIMPSE is a new benchmark designed to evaluate whether large vision-language models can genuinely understand and reason with videos, revealing that current models still largely rely on superficial frame analysis.

Contribution

The paper introduces GLIMPSE, a comprehensive video understanding benchmark with carefully crafted questions that require full video reasoning, highlighting the gap in current LVLMs' video comprehension capabilities.

Findings

01

Human accuracy on GLIMPSE is 94.82%.

02

Current LVLMs, like GPT-3, achieve only 66.43%.

03

Models struggle to move beyond superficial frame analysis.

Abstract

Existing video benchmarks often resemble image-based benchmarks, with question types like "What actions does the person perform throughout the video?" or "What color is the woman's dress in the video?" For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce GLIMPSE, a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, GLIMPSE emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AIMING-Lab-UNC/GLIMPSE
dataset· 2 dl
2 dl

Videos

GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?· underline

Taxonomy

TopicsMultimodal Machine Learning Applications