Focus Anywhere for Fine-grained Multi-page Document Understanding

Chenglong Liu; Haoran Wei; Jinyue Chen; Lingyu Kong; Zheng Ge; Zining; Zhu; Liang Zhao; Jianjian Sun; Chunrui Han; Xiangyu Zhang

arXiv:2405.14295·cs.CV·May 24, 2024·2 cites

Focus Anywhere for Fine-grained Multi-page Document Understanding

Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining, Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang

PDF

Open Access 1 Repo 10 Models 1 Datasets

TL;DR

This paper introduces Fox, a novel approach that enhances large vision-language models to perform fine-grained, multi-page document understanding by focusing attention on specific regions and integrating multiple visual vocabularies.

Contribution

The paper presents a new pipeline, hybrid data, and tuning strategy that significantly improves LVLMs' ability to understand multi-page documents at a fine-grained level without modifying model weights.

Findings

01

Fox outperforms existing models on fine-grained document tasks.

02

The approach enables focus anywhere in multi-page documents.

03

A new benchmark with 9 sub-tasks promotes further research.

Abstract

Modern LVLMs still struggle to achieve fine-grained document understanding, such as OCR/translation/caption for regions of interest to the user, tasks that require the context of the entire page, or even multiple pages. Accordingly, this paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents. We introduce a novel task to boost the document understanding by making LVLMs focus attention on the document-level region, such as redefining full-page OCR as foreground focus. We employ multiple vision vocabularies to extract visual hybrid knowledge for interleaved document pages (e.g., a page containing a photo). Meanwhile, we render cross-vocabulary vision data as the catalyzer to achieve a full reaction of multiple visual vocabularies and in-document figure understanding. Further, without modifying the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Ucas-HaoranWei/Vary
pytorch

Models

Datasets

meituan-longcat/UNO-Bench
dataset· 2.0k dl
2.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Digital Humanities and Scholarship

MethodsFocus