Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling

Xinlei Yu; Chengming Xu; Zhangquan Chen; Yudong Zhang; Shilin Lu; Cheng Yang; Jiangning Zhang; Shuicheng Yan; Xiaobin Hu

arXiv:2508.03404·cs.CV·November 17, 2025

Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling

Xinlei Yu, Chengming Xu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Cheng Yang, Jiangning Zhang, Shuicheng Yan, Xiaobin Hu

PDF

Open Access

TL;DR

This paper introduces MACT, a multi-agent framework for visual document understanding that dynamically allocates computational resources through adaptive test-time scaling, improving reasoning accuracy and efficiency.

Contribution

It pioneers a procedural scaling paradigm with a multi-agent architecture and adaptive resource allocation for visual document reasoning tasks.

Findings

01

Achieves top-three performance on multiple benchmarks.

02

Improves accuracy by 9.9-11.5% over base models.

03

Uses fewer parameters while maintaining reasoning capabilities.

Abstract

The dominant paradigm of monolithic scaling in Vision-Language Models (VLMs) is failing for understanding and reasoning in documents, yielding diminishing returns as it struggles with the inherent need of this domain for document-based procedural reasoning, cognitive complexity, and factual accuracy. To this end, we introduce MACT, a Multi-Agent Collaboration framework with agent-wise adaptive Test-time scaling that pioneers a paradigm shift to procedural scaling, adapting dynamically to the functional entities of visual documents understanding and reasoning. MACT decomposes the visual document processing flow into four specialized agents, i.e., planning, execution, judgment, and answer, to resolve cognitive overload and introduce a critical self-correction loop for factual grounding. This collaborative architecture is amplified by an agent-wise adaptive test-time scaling strategy that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Handwritten Text Recognition Techniques