Focus Anywhere for Fine-grained Multi-page Document Understanding
Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining, Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang

TL;DR
This paper introduces Fox, a novel approach that enhances large vision-language models to perform fine-grained, multi-page document understanding by focusing attention on specific regions and integrating multiple visual vocabularies.
Contribution
The paper presents a new pipeline, hybrid data, and tuning strategy that significantly improves LVLMs' ability to understand multi-page documents at a fine-grained level without modifying model weights.
Findings
Fox outperforms existing models on fine-grained document tasks.
The approach enables focus anywhere in multi-page documents.
A new benchmark with 9 sub-tasks promotes further research.
Abstract
Modern LVLMs still struggle to achieve fine-grained document understanding, such as OCR/translation/caption for regions of interest to the user, tasks that require the context of the entire page, or even multiple pages. Accordingly, this paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents. We introduce a novel task to boost the document understanding by making LVLMs focus attention on the document-level region, such as redefining full-page OCR as foreground focus. We employ multiple vision vocabularies to extract visual hybrid knowledge for interleaved document pages (e.g., a page containing a photo). Meanwhile, we render cross-vocabulary vision data as the catalyzer to achieve a full reaction of multiple visual vocabularies and in-document figure understanding. Further, without modifying the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗stepfun-ai/GOT-OCR2_0model· 56k dl· ♡ 153156k dl♡ 1531
- 🤗mallapraveen/GOT-OCR2_0model· 4 dl4 dl
- 🤗srimanth-d/GOT_CPUmodel· 42 dl· ♡ 1142 dl♡ 11
- 🤗RufusRubin777/GOT-OCR2_0_CPUmodel· 6 dl6 dl
- 🤗Maltokar/GOT_OCR_MPmodel· 2 dl2 dl
- 🤗aarishshahmohsin/got_ocr_2model· 5 dl5 dl
- 🤗tdnathmlenthusiast/testermodel· 2 dl2 dl
- 🤗uzumaki06/OCR2.0model· 1 dl1 dl
- 🤗philipp-zettl/GOT-OCR2_0model
- 🤗justlurkinhere/GOT-OCR2_0model· 4 dl4 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Digital Humanities and Scholarship
MethodsFocus
