Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA

Tong Wu; Thanet Markchom

arXiv:2601.03073·cs.CV·January 7, 2026

Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA

Tong Wu, Thanet Markchom

PDF

Open Access

TL;DR

This paper introduces a multi-agent LLM framework for cartoon VQA, addressing the unique challenges of interpreting stylised visuals and narrative context, and evaluates its effectiveness on two cartoon datasets.

Contribution

It proposes a novel multi-agent architecture with visual, language, and critic agents tailored for cartoon VQA, enhancing structured reasoning in stylised imagery.

Findings

01

Each agent's contribution to prediction is analyzed.

02

The framework improves understanding of LLM multi-agent behaviour in cartoon VQA.

03

Experimental results demonstrate the framework's effectiveness on Pororo and Simpsons datasets.

Abstract

Visual Question Answering (VQA) for stylised cartoon imagery presents challenges, such as interpreting exaggerated visual abstraction and narrative-driven context, which are not adequately addressed by standard large language models (LLMs) trained on natural images. To investigate this issue, a multi-agent LLM framework is introduced, specifically designed for VQA tasks in cartoon imagery. The proposed architecture consists of three specialised agents: visual agent, language agent and critic agent, which work collaboratively to support structured reasoning by integrating visual cues and narrative context. The framework was systematically evaluated on two cartoon-based VQA datasets: Pororo and Simpsons. Experimental results provide a detailed analysis of how each agent contributes to the final prediction, offering a deeper understanding of LLM-based multi-agent behaviour in cartoon VQA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis