MangaUB: A Manga Understanding Benchmark for Large Multimodal Models

Hikaru Ikuta; Leslie W\"ohler; Kiyoharu Aizawa

arXiv:2407.19034·cs.CV·October 23, 2024

MangaUB: A Manga Understanding Benchmark for Large Multimodal Models

Hikaru Ikuta, Leslie W\"ohler, Kiyoharu Aizawa

PDF

Open Access

TL;DR

MangaUB is a new benchmark designed to evaluate large multimodal models' ability to understand manga, highlighting strengths in image recognition and challenges in cross-panel comprehension.

Contribution

The paper introduces MangaUB, a comprehensive benchmark for assessing LMMs' manga understanding capabilities across single and multiple panels.

Findings

01

Strong performance in image content recognition

02

Challenges in understanding emotions across panels

03

Identifies areas for future model improvements

Abstract

Manga is a popular medium that combines stylized drawings and text to convey stories. As manga panels differ from natural images, computational systems traditionally had to be designed specifically for manga. Recently, the adaptive nature of modern large multimodal models (LMMs) shows possibilities for more general approaches. To provide an analysis of the current capability of LMMs for manga understanding tasks and identifying areas for their improvement, we design and evaluate MangaUB, a novel manga understanding benchmark for LMMs. MangaUB is designed to assess the recognition and understanding of content shown in a single panel as well as conveyed across multiple panels, allowing for a fine-grained analysis of a model's various capabilities required for manga understanding. Our results show strong performance on the recognition of image content, while understanding the emotion and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Speech and dialogue systems · Natural Language Processing Techniques