CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books

Marc Serra Ortega; Emanuele Vivoli; Artemis Llabr\'es; Dimosthenis Karatzas

arXiv:2507.10053·cs.CV·July 15, 2025

CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books

Marc Serra Ortega, Emanuele Vivoli, Artemis Llabr\'es, Dimosthenis Karatzas

PDF

Open Access

TL;DR

This paper presents CoSMo, a multimodal Transformer model that advances page stream segmentation in comic books by leveraging visual and multimodal features, outperforming existing methods and setting new benchmarks.

Contribution

Introduction of CoSMo, a novel multimodal Transformer for comic book page segmentation, along with a new large annotated dataset and comprehensive evaluation showing its superiority.

Findings

01

Visual features dominate macro-structure segmentation.

02

Multimodal approach improves ambiguity resolution.

03

CoSMo achieves state-of-the-art performance.

Abstract

This paper introduces CoSMo, a novel multimodal Transformer for Page Stream Segmentation (PSS) in comic books, a critical task for automated content understanding, as it is a necessary first stage for many downstream tasks like character analysis, story indexing, or metadata enrichment. We formalize PSS for this unique medium and curate a new 20,800-page annotated dataset. CoSMo, developed in vision-only and multimodal variants, consistently outperforms traditional baselines and significantly larger general-purpose vision-language models across F1-Macro, Panoptic Quality, and stream-level metrics. Our findings highlight the dominance of visual features for comic PSS macro-structure, yet demonstrate multimodal benefits in resolving challenging ambiguities. CoSMo establishes a new state-of-the-art, paving the way for scalable comic book analysis.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Comics and Graphic Narratives · Artificial Intelligence in Games