Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models

Yule Chen; Yufan Ren; Sabine S\"usstrunk

arXiv:2511.06490·cs.CV·November 11, 2025

Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models

Yule Chen, Yufan Ren, Sabine S\"usstrunk

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new benchmark and methods to improve vision-language models' understanding of comics, addressing their current limitations in recognizing complex visual narratives and densely packed multi-panel layouts.

Contribution

It presents the first comprehensive comic understanding benchmark, evaluates state-of-the-art models, and proposes Region-Aware Reinforcement Learning to enhance model focus on relevant regions.

Findings

01

State-of-the-art models perform poorly on comic tasks

02

Post-training strategies improve model performance

03

RARL significantly boosts entity recognition and storyline ordering

Abstract

Complex visual narratives, such as comics, present a significant challenge to Vision-Language Models (VLMs). Despite excelling on natural images, VLMs often struggle with stylized line art, onomatopoeia, and densely packed multi-panel layouts. To address this gap, we introduce AI4VA-FG, the first fine-grained and comprehensive benchmark for VLM-based comic understanding. It spans tasks from foundational recognition and detection to high-level character reasoning and narrative construction, supported by dense annotations for characters, poses, and depth. Beyond that, we evaluate state-of-the-art proprietary models, including GPT-4o and Gemini-2.5, and open-source models such as Qwen2.5-VL, revealing substantial performance deficits across core tasks of our benchmarks and underscoring that comic understanding remains an unsolved challenge. To enhance VLMs' capabilities in this domain, we…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- The task in the comic domain is interesting and niche. They categorize 7 new sub-tasks, and each of them features some understanding capabilities. Existing models do not perform very well in this domain. - They propose a region-aware reinforcement learning to incentivize models to zoom in on images when needed. It helps the performance significantly across different models. - Multiple models are used in the experiments.

Weaknesses

- Even though existing methods fall short in comic tasks, it is not very convincing why the comic task is important and can benefit other tasks in the community. - They do not have much technical contribution. The SFT and RL are not novel concepts. While for comic books and small texts, it is natural to use zoom-in tools, it is the only tool that has a special reward in the training. It is doubtful how many cases in the test split are relevant to the zoom-in tool usage and if there are other imp

Reviewer 02Rating 4Confidence 3

Strengths

The paper makes some contributions by addressing the overlooked challenge of comic understanding with a new fine-grained benchmark (AI4VA-FG) and a creative training approach, Region-Aware Reinforcement Learning (RARL). Originality: It tackles an underexplored yet challenging domain of comic understanding by introducing a fine-grained benchmark (AI4VA-FG) that surpasses existing visual narrative datasets. The proposed Region-Aware Reinforcement Learning (RARL) framework is also an inventive ada

Weaknesses

1. I think the dataset construction pipeline is somehow missing in this paper. It does not show the details about how this dataset is constructed based on AI4VA dataset. I think this is one of the key part of this paper. 2. The style of the comic is very limited. It is only sourced from two mid-twentieth-century FrancoBelgian comics series. However, there are many other kinds of comics in the world. This is a huge limitation for the application of this dataset. 3. The experiment also shows that

Reviewer 03Rating 2Confidence 5

Strengths

- The authors introduce a novel, fine-grained benchmark which demonstrates that existing state-of-the-art models have deficiencies in comic understanding tasks, highlighting a clear real-world application value. Detailed composition of the benchmark is also provided. - Propose a Region-Aware Reinforcement Learning framework that effectively addresses the aforementioned comic understanding challenges. This framework learns where and when to zoom in a manner akin to how humans process complex visu

Weaknesses

- It’s understandable that the authors only tested their RARL framework on Qwen2.5-VL models due to economic and computational constraints. However, this still needs more analysis and explanation. The paper should provide more detailed computational reports and visualizations of the training process to better substantiate the strategy's generality. - The paper lacks a numerical analysis of the reward components during training (e.g., their fluctuation range or trends), and provides no correspond

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling