ComiCap: A VLMs pipeline for dense captioning of Comic Panels

Emanuele Vivoli; Niccol\`o Biondi; Marco Bertini; Dimosthenis Karatzas

arXiv:2409.16159·cs.CV·September 25, 2024

ComiCap: A VLMs pipeline for dense captioning of Comic Panels

Emanuele Vivoli, Niccol\`o Biondi, Marco Bertini, Dimosthenis Karatzas

PDF

Open Access 2 Repos

TL;DR

This paper introduces ComiCap, a pipeline utilizing Vision-Language Models to generate dense, grounded captions for comic panels, achieving superior results without additional training and enabling large-scale annotation of comic books.

Contribution

It presents a novel VLM-based pipeline for dense captioning of comic panels, including an attribute-retaining metric and a new annotated dataset for evaluation.

Findings

01

The pipeline outperforms specialized models in caption quality.

02

It can annotate over 2 million comic panels efficiently.

03

The method requires no additional training of VLMs.

Abstract

The comic domain is rapidly advancing with the development of single- and multi-page analysis and synthesis models. Recent benchmarks and datasets have been introduced to support and assess models' capabilities in tasks such as detection (panels, characters, text), linking (character re-identification and speaker identification), and analysis of comic elements (e.g., dialog transcription). However, to provide a comprehensive understanding of the storyline, a model must not only extract elements but also understand their relationships and generate highly informative captions. In this work, we propose a pipeline that leverages Vision-Language Models (VLMs) to obtain dense, grounded captions. To construct our pipeline, we introduce an attribute-retaining metric that assesses whether all important attributes are identified in the caption. Additionally, we created a densely annotated test…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training