Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models
Yule Chen, Yufan Ren, Sabine S\"usstrunk

TL;DR
This paper introduces a new benchmark and methods to improve vision-language models' understanding of comics, addressing their current limitations in recognizing complex visual narratives and densely packed multi-panel layouts.
Contribution
It presents the first comprehensive comic understanding benchmark, evaluates state-of-the-art models, and proposes Region-Aware Reinforcement Learning to enhance model focus on relevant regions.
Findings
State-of-the-art models perform poorly on comic tasks
Post-training strategies improve model performance
RARL significantly boosts entity recognition and storyline ordering
Abstract
Complex visual narratives, such as comics, present a significant challenge to Vision-Language Models (VLMs). Despite excelling on natural images, VLMs often struggle with stylized line art, onomatopoeia, and densely packed multi-panel layouts. To address this gap, we introduce AI4VA-FG, the first fine-grained and comprehensive benchmark for VLM-based comic understanding. It spans tasks from foundational recognition and detection to high-level character reasoning and narrative construction, supported by dense annotations for characters, poses, and depth. Beyond that, we evaluate state-of-the-art proprietary models, including GPT-4o and Gemini-2.5, and open-source models such as Qwen2.5-VL, revealing substantial performance deficits across core tasks of our benchmarks and underscoring that comic understanding remains an unsolved challenge. To enhance VLMs' capabilities in this domain, we…
Peer Reviews
Decision·Submitted to ICLR 2026
- The task in the comic domain is interesting and niche. They categorize 7 new sub-tasks, and each of them features some understanding capabilities. Existing models do not perform very well in this domain. - They propose a region-aware reinforcement learning to incentivize models to zoom in on images when needed. It helps the performance significantly across different models. - Multiple models are used in the experiments.
- Even though existing methods fall short in comic tasks, it is not very convincing why the comic task is important and can benefit other tasks in the community. - They do not have much technical contribution. The SFT and RL are not novel concepts. While for comic books and small texts, it is natural to use zoom-in tools, it is the only tool that has a special reward in the training. It is doubtful how many cases in the test split are relevant to the zoom-in tool usage and if there are other imp
The paper makes some contributions by addressing the overlooked challenge of comic understanding with a new fine-grained benchmark (AI4VA-FG) and a creative training approach, Region-Aware Reinforcement Learning (RARL). Originality: It tackles an underexplored yet challenging domain of comic understanding by introducing a fine-grained benchmark (AI4VA-FG) that surpasses existing visual narrative datasets. The proposed Region-Aware Reinforcement Learning (RARL) framework is also an inventive ada
1. I think the dataset construction pipeline is somehow missing in this paper. It does not show the details about how this dataset is constructed based on AI4VA dataset. I think this is one of the key part of this paper. 2. The style of the comic is very limited. It is only sourced from two mid-twentieth-century FrancoBelgian comics series. However, there are many other kinds of comics in the world. This is a huge limitation for the application of this dataset. 3. The experiment also shows that
- The authors introduce a novel, fine-grained benchmark which demonstrates that existing state-of-the-art models have deficiencies in comic understanding tasks, highlighting a clear real-world application value. Detailed composition of the benchmark is also provided. - Propose a Region-Aware Reinforcement Learning framework that effectively addresses the aforementioned comic understanding challenges. This framework learns where and when to zoom in a manner akin to how humans process complex visu
- It’s understandable that the authors only tested their RARL framework on Qwen2.5-VL models due to economic and computational constraints. However, this still needs more analysis and explanation. The paper should provide more detailed computational reports and visualizations of the training process to better substantiate the strategy's generality. - The paper lacks a numerical analysis of the reward components during training (e.g., their fluctuation range or trends), and provides no correspond
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
