Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration
Haoyu Cao, Changcun Bao, Chaohu Liu, Huang Chen, Kun Yin, Hao Liu,, Yinsong Liu, Deqiang Jiang, Xing Sun

TL;DR
SeRum is a new end-to-end document understanding model that focuses on selective regions of interest, improving accuracy and speed over existing multi-stage approaches by using a content-aware token merge mechanism.
Contribution
Introduces SeRum, a novel model that converts document understanding into a local decoding process with content-aware token merging, reducing complexity and enhancing performance.
Findings
Achieves state-of-the-art results on document understanding tasks.
Provides competitive performance on text spotting tasks.
Speeds up decoding compared to multi-stage methods.
Abstract
We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation. Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive, SeRum converts document image understanding and recognition tasks into a local decoding process of the visual tokens of interest, using a content-aware token merge module. This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme. We also designed several pre-training tasks to enhance the understanding and local awareness of the model. Experimental results demonstrate that SeRum achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
