Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA
Tong Wu, Thanet Markchom

TL;DR
This paper introduces a multi-agent LLM framework for cartoon VQA, addressing the unique challenges of interpreting stylised visuals and narrative context, and evaluates its effectiveness on two cartoon datasets.
Contribution
It proposes a novel multi-agent architecture with visual, language, and critic agents tailored for cartoon VQA, enhancing structured reasoning in stylised imagery.
Findings
Each agent's contribution to prediction is analyzed.
The framework improves understanding of LLM multi-agent behaviour in cartoon VQA.
Experimental results demonstrate the framework's effectiveness on Pororo and Simpsons datasets.
Abstract
Visual Question Answering (VQA) for stylised cartoon imagery presents challenges, such as interpreting exaggerated visual abstraction and narrative-driven context, which are not adequately addressed by standard large language models (LLMs) trained on natural images. To investigate this issue, a multi-agent LLM framework is introduced, specifically designed for VQA tasks in cartoon imagery. The proposed architecture consists of three specialised agents: visual agent, language agent and critic agent, which work collaboratively to support structured reasoning by integrating visual cues and narrative context. The framework was systematically evaluated on two cartoon-based VQA datasets: Pororo and Simpsons. Experimental results provide a detailed analysis of how each agent contributes to the final prediction, offering a deeper understanding of LLM-based multi-agent behaviour in cartoon VQA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis
