Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment
Christopher Driggers-Ellis, Nachiketh Tibrewal, Rohit Bogulla, Harsh Khanna, Sangpil Youm, Christan Grant, Bonnie Dorr

TL;DR
This paper evaluates vision-language models for comic understanding, revealing that semantic similarity metrics are unreliable due to hallucinations, and highlights the need for improved data and hallucination mitigation.
Contribution
It introduces a benchmark for comic interpretation with a focus on hallucinations, providing taxonomy and guidance for future research in this domain.
Findings
Hallucinations significantly affect model performance.
Semantic similarity is a spurious measure for comic understanding.
Guidelines for future research on hallucination mitigation.
Abstract
A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComics and Graphic Narratives · Multimodal Machine Learning Applications · Artificial Intelligence in Games
