Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion
Yingxuan Li, Ryota Hinami, Kiyoharu Aizawa, Yusuke Matsui

TL;DR
This paper introduces a zero-shot multimodal framework for identifying characters and predicting speakers in comics without requiring annotations, leveraging large language models and iterative fusion techniques.
Contribution
It presents the first multimodal, zero-shot approach for character and speaker recognition in comics, eliminating the need for comic-specific training data.
Findings
Effective zero-shot character identification and speaker prediction
Establishes a robust baseline for multimodal comic analysis
Applicable to any comic series without additional training
Abstract
Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComics and Graphic Narratives · Translation Studies and Practices · Handwritten Text Recognition Techniques
