Re:Verse -- Can Your VLM Read a Manga?
Aaditya Baranwal, Madhav Kataria, Naitik Agrawal, Yogesh S Rawat, Shruti Vyas

TL;DR
This paper investigates the limitations of current Vision Language Models in understanding manga narratives, highlighting their struggles with temporal reasoning, character consistency, and story coherence across extended sequences.
Contribution
It introduces a novel evaluation framework combining multimodal annotation, cross-modal analysis, and retrieval methods to systematically assess narrative understanding in VLMs.
Findings
Current models excel at panel interpretation but fail at causal and temporal reasoning.
The framework applied to Re:Zero manga reveals significant gaps in story-level comprehension.
Provides actionable insights and a foundation for future narrative intelligence evaluation.
Abstract
Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning when processing sequential visual storytelling. Through a comprehensive investigation of manga narrative understanding, we reveal that while recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion, core requirements for coherent story comprehension. We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment to systematically characterize these limitations. Our methodology includes (i) a rigorous annotation protocol linking visual elements to narrative structure through aligned light novel text, (ii) comprehensive evaluation across multiple reasoning paradigms, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
