ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering
Aymen Lassoued, Mohamed Ali Souibgui, Yousri Kessentini

TL;DR
ORCA introduces a multi-agent system for document VQA that decomposes complex questions, coordinates specialized agents, and employs iterative refinement to improve reasoning accuracy and reliability.
Contribution
The paper presents a novel multi-agent framework with strategic coordination and iterative refinement for improved document visual question answering.
Findings
Significant performance improvements over state-of-the-art methods
Effective decomposition of complex questions into manageable sub-tasks
Robust answer validation through debate and sanity checks
Abstract
Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
