DocMamba: Efficient Document Pre-training with State Space Model

Pengfei Hu; Zhenrong Zhang; Jiefeng Ma; Shuhang Liu; Jun Du; Jianshu; Zhang

arXiv:2409.11887·cs.CL·February 11, 2025

DocMamba: Efficient Document Pre-training with State Space Model

Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Shuhang Liu, Jun Du, Jianshu, Zhang

PDF

Open Access

TL;DR

DocMamba is a new document understanding framework that uses state space models to achieve linear complexity, enabling efficient processing of long, visually-rich documents with state-of-the-art accuracy.

Contribution

It introduces a novel state space model-based framework with SFBS for improved efficiency and effectiveness in document understanding tasks.

Findings

01

Achieves state-of-the-art results on FUNSD, CORD, and SORIE datasets.

02

Significantly improves speed and reduces memory usage.

03

Demonstrates potential for length extrapolation on HRDoc.

Abstract

In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism's quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings