MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Jeonghun Baek; Kazuki Egashira; Shota Onohara; Atsuyuki Miyai; Yuki Imajuku; Hikaru Ikuta; Kiyoharu Aizawa

arXiv:2505.20298·cs.CL·January 27, 2026

MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, Kiyoharu Aizawa

PDF

Open Access 1 Repo 2 Models 1 Video

TL;DR

This paper introduces MangaVQA and MangaLMM, benchmarks and a specialized model for understanding complex multimodal manga narratives, enabling more accurate AI comprehension of manga's visual and textual storytelling.

Contribution

It presents novel benchmarks for manga understanding and develops a specialized multimodal model fine-tuned for manga comprehension, advancing AI's ability in this domain.

Findings

01

MangaVQA effectively evaluates contextual understanding in manga.

02

MangaLMM outperforms baseline models on manga understanding tasks.

03

Benchmark results highlight strengths and limitations of current LMMs.

Abstract

Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

manga109/mangalmm
pytorchOfficial

Models

Videos

MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques