Zero-Shot Character Identification and Speaker Prediction in Comics via   Iterative Multimodal Fusion

Yingxuan Li; Ryota Hinami; Kiyoharu Aizawa; Yusuke Matsui

arXiv:2404.13993·cs.MM·September 6, 2024·1 cites

Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion

Yingxuan Li, Ryota Hinami, Kiyoharu Aizawa, Yusuke Matsui

PDF

Open Access 1 Repo

TL;DR

This paper introduces a zero-shot multimodal framework for identifying characters and predicting speakers in comics without requiring annotations, leveraging large language models and iterative fusion techniques.

Contribution

It presents the first multimodal, zero-shot approach for character and speaker recognition in comics, eliminating the need for comic-specific training data.

Findings

01

Effective zero-shot character identification and speaker prediction

02

Establishes a robust baseline for multimodal comic analysis

03

Applicable to any comic series without additional training

Abstract

Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liyingxuan1012/zeroshot-speaker-prediction
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComics and Graphic Narratives · Translation Studies and Practices · Handwritten Text Recognition Techniques