Attention Where It Matters: Rethinking Visual Document Understanding   with Selective Region Concentration

Haoyu Cao; Changcun Bao; Chaohu Liu; Huang Chen; Kun Yin; Hao Liu,; Yinsong Liu; Deqiang Jiang; Xing Sun

arXiv:2309.01131·cs.CV·September 6, 2023·1 cites

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Haoyu Cao, Changcun Bao, Chaohu Liu, Huang Chen, Kun Yin, Hao Liu,, Yinsong Liu, Deqiang Jiang, Xing Sun

PDF

Open Access

TL;DR

SeRum is a new end-to-end document understanding model that focuses on selective regions of interest, improving accuracy and speed over existing multi-stage approaches by using a content-aware token merge mechanism.

Contribution

Introduces SeRum, a novel model that converts document understanding into a local decoding process with content-aware token merging, reducing complexity and enhancing performance.

Findings

01

Achieves state-of-the-art results on document understanding tasks.

02

Provides competitive performance on text spotting tasks.

03

Speeds up decoding compared to multi-stage methods.

Abstract

We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation. Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive, SeRum converts document image understanding and recognition tasks into a local decoding process of the visual tokens of interest, using a content-aware token merge module. This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme. We also designed several pre-training tasks to enhance the understanding and local awareness of the model. Experimental results demonstrate that SeRum achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings