Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Hao Feng; Shu Wei; Xiang Fei; Wei Shi; Yingdong Han; Lei Liao; Jinghui Lu; Binghong Wu; Qi Liu; Chunhui Lin; Jingqun Tang; Hao Liu; and Can Huang

arXiv:2505.14059·cs.CV·May 21, 2025

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao Liu, and Can Huang

PDF

Open Access 1 Repo 1 Video

TL;DR

Dolphin is a multimodal document image parsing model that efficiently generates structured layout elements and content by leveraging heterogeneous anchors and prompts, achieving state-of-the-art results on diverse benchmarks.

Contribution

The paper introduces Dolphin, a novel analyze-then-parse model that uses heterogeneous anchor prompting and a large-scale dataset for improved document image parsing.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Ensures high efficiency with a lightweight architecture.

03

Effectively handles complex intertwined document elements.

Abstract

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/dolphin
pytorchOfficial

Videos

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications