ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

Aymen Lassoued; Mohamed Ali Souibgui; Yousri Kessentini

arXiv:2603.02438·cs.CV·March 4, 2026

ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

Aymen Lassoued, Mohamed Ali Souibgui, Yousri Kessentini

PDF

Open Access

TL;DR

ORCA introduces a multi-agent system for document VQA that decomposes complex questions, coordinates specialized agents, and employs iterative refinement to improve reasoning accuracy and reliability.

Contribution

The paper presents a novel multi-agent framework with strategic coordination and iterative refinement for improved document visual question answering.

Findings

01

Significant performance improvements over state-of-the-art methods

02

Effective decomposition of complex questions into manageable sub-tasks

03

Robust answer validation through debate and sanity checks

Abstract

Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques