ComiCap: A VLMs pipeline for dense captioning of Comic Panels
Emanuele Vivoli, Niccol\`o Biondi, Marco Bertini, Dimosthenis Karatzas

TL;DR
This paper introduces ComiCap, a pipeline utilizing Vision-Language Models to generate dense, grounded captions for comic panels, achieving superior results without additional training and enabling large-scale annotation of comic books.
Contribution
It presents a novel VLM-based pipeline for dense captioning of comic panels, including an attribute-retaining metric and a new annotated dataset for evaluation.
Findings
The pipeline outperforms specialized models in caption quality.
It can annotate over 2 million comic panels efficiently.
The method requires no additional training of VLMs.
Abstract
The comic domain is rapidly advancing with the development of single- and multi-page analysis and synthesis models. Recent benchmarks and datasets have been introduced to support and assess models' capabilities in tasks such as detection (panels, characters, text), linking (character re-identification and speaker identification), and analysis of comic elements (e.g., dialog transcription). However, to provide a comprehensive understanding of the storyline, a model must not only extract elements but also understand their relationships and generate highly informative captions. In this work, we propose a pipeline that leverages Vision-Language Models (VLMs) to obtain dense, grounded captions. To construct our pipeline, we introduce an attribute-retaining metric that assesses whether all important attributes are identified in the caption. Additionally, we created a densely annotated test…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training
