Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting

Hao Feng; Wei Shi; Ke Zhang; Xiang Fei; Lei Liao; Dingkang Yang; Yongkun Du; Xuecheng Wu; Jingqun Tang; Yang Liu; Hong Chen; Can Huang

arXiv:2602.05384·cs.CV·February 6, 2026

Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting

Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao, Dingkang Yang, Yongkun Du, Xuecheng Wu, Jingqun Tang, Yang Liu, Hong Chen, Can Huang

PDF

Open Access

TL;DR

Dolphin-v2 is a scalable, two-stage document parsing model that effectively handles diverse document types, including photographed and digital-born, with enhanced layout analysis, fine-grained detection, and semantic attribute extraction, outperforming previous systems.

Contribution

The paper introduces Dolphin-v2, a novel document parsing approach that improves robustness, detail, and efficiency over prior models by handling distorted images and extracting richer information.

Findings

01

+14.78 points on OmniDocBench

02

91% error reduction on photographed documents

03

Efficient parallel inference

Abstract

Document parsing has garnered widespread attention as vision-language models (VLMs) advance OCR capabilities. However, the field remains fragmented across dozens of specialized models with varying strengths, forcing users to navigate complex model selection and limiting system scalability. Moreover, existing two-stage approaches depend on axis-aligned bounding boxes for layout detection, failing to handle distorted or photographed documents effectively. To this end, we present Dolphin-v2, a two-stage document image parsing model that substantially improves upon the original Dolphin. In the first stage, Dolphin-v2 jointly performs document type classification (digital-born versus photographed) alongside layout analysis. For digital-born documents, it conducts finer-grained element detection with reading order prediction. In the second stage, we employ a hybrid parsing strategy:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications