Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

Xiaochen Wang; Heming Xia; Jialin Song; Longyu Guan; Yixin Yang; Qingxiu Dong; Weiyao Luo; Yifan Pu; Yiru Wang; Xiangdi Meng; Wenjie Li; Zhifang Sui

arXiv:2502.13925·cs.CL·October 10, 2025

Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

Xiaochen Wang, Heming Xia, Jialin Song, Longyu Guan, Yixin Yang, Qingxiu Dong, Weiyao Luo, Yifan Pu, Yiru Wang, Xiangdi Meng, Wenjie Li, Zhifang Sui

PDF

Open Access

TL;DR

This paper introduces StripCipher, a new benchmark for evaluating Large Multimodal Models' ability to understand and reason over image sequences, revealing significant performance gaps compared to humans especially in reordering tasks.

Contribution

The paper presents StripCipher, a novel benchmark with a dataset and tasks for assessing LMMs' sequential image understanding, highlighting current limitations and challenges.

Findings

01

GPT-4o achieves 23.93% accuracy in reordering images

02

Performance gap of over 50% between LMMs and humans

03

Input format significantly affects LMMs' sequential reasoning

Abstract

Large Multimodal Models (LMMs) have achieved remarkable success across various visual-language tasks. However, existing benchmarks predominantly focus on single-image understanding, leaving the analysis of image sequences largely unexplored. To address this limitation, we introduce StripCipher, a comprehensive benchmark designed to evaluate capabilities of LMMs to comprehend and reason over sequential images. StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. Our evaluation of 16 state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities, particularly in tasks that require reordering shuffled sequential images. For instance, GPT-4o achieves only 23.93% accuracy in the reordering subtask, which is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsFocus