Towards Faithful Reasoning in Comics for Small MLLMs
Chengcheng Feng, Haojie Yin, Yucheng Jin, Kaizhu Huang

TL;DR
This paper introduces a two-stage framework, MoCoT and VERA, to improve faithful reasoning in small Multimodal Large Language Models for comic understanding by preserving multi-cue interpretation.
Contribution
It proposes a novel supervision construction and reward mechanism that enhance reasoning faithfulness and performance in small MLLMs for comic and visual reasoning tasks.
Findings
Achieves strong results on five comic understanding benchmarks.
Surpasses several 7B baseline models.
Improves four small MLLMs by an average of 12.1%.
Abstract
Comic understanding presents a significant challenge for Multimodal Large Language Models (MLLMs), as the intended meaning of a comic often emerges from the joint interpretation of visual, textual, and social cues. This naturally motivates Chain-of-Thought (CoT) prompting, since explicit intermediate reasoning appears promising for integrating such heterogeneous signals. However, existing CoT methods are poorly matched to this structure: they tend to force interpretation into a single reasoning path before multiple cues have been jointly considered, often degrading performance, especially for small MLLMs. Our key idea is to explicitly preserve multi-cue interpretation during supervision construction, rather than collapsing comic understanding into a single reasoning chain. To this end, we propose a two-stage framework for faithful comic reasoning in small MLLMs. First, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
