Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning

Ye Mo; Zirui Shao; Kai Ye; Xianwei Mao; Bo Zhang; Hangdi Xing; Peng Ye; Gang Huang; Kehan Chen; Zhou Huan; Zixu Yan; Sheng Zhou

arXiv:2505.18603·cs.AI·May 27, 2025

Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning

Ye Mo, Zirui Shao, Kai Ye, Xianwei Mao, Bo Zhang, Hangdi Xing, Peng Ye, Gang Huang, Kehan Chen, Zhou Huan, Zixu Yan, Sheng Zhou

PDF

Open Access

TL;DR

Doc-CoB introduces a visual reasoning mechanism that enables multimodal language models to focus on relevant document regions, significantly improving understanding by mimicking human coarse-to-fine reading.

Contribution

It presents a novel visual reasoning approach that enhances MLLMs' focus on critical regions without altering their architecture, using a new training pipeline and auxiliary tasks.

Findings

01

Significant performance improvements on seven benchmarks.

02

Effective integration with multiple MLLM architectures.

03

Enhanced focus on relevant document regions improves accuracy.

Abstract

Multimodal large language models (MLLMs) have made significant progress in document understanding. However, the information-dense nature of document images still poses challenges, as most queries depend on only a few relevant regions, with the rest being redundant. Existing one-pass MLLMs process entire document images without considering query relevance, often failing to focus on critical regions and producing unfaithful responses. Inspired by the human coarse-to-fine reading pattern, we introduce Doc-CoB (Chain-of-Box), a simple-yet-effective mechanism that integrates human-style visual reasoning into MLLM without modifying its architecture. Our method allows the model to autonomously select the set of regions (boxes) most relevant to the query, and then focus attention on them for further understanding. We first design a fully automatic pipeline, integrating a commercial MLLM with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Focus · Sparse Evolutionary Training