OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

Wenwen Yu; Zhibo Yang; Jianqiang Wan; Sibo Song; Jun Tang; Wenqing Cheng; Yuliang Liu; Xiang Bai

arXiv:2502.16161·cs.CV·April 22, 2026

OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, Xiang Bai

PDF

1 Repo

TL;DR

OmniParser V2 introduces a unified, task-agnostic framework for visual text parsing tasks using Structured-Points-of-Thought prompting, achieving state-of-the-art results and demonstrating versatility with multimodal large language models.

Contribution

It proposes SPOT prompting schemas and a unified encoder-decoder architecture to simplify and generalize visual text parsing across multiple tasks.

Findings

01

Achieves state-of-the-art or competitive results on four VsTP tasks.

02

Effectively integrates with multimodal large language models.

03

Simplifies the processing pipeline by eliminating task-specific architectures.

Abstract

Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individual tasks. This leads to modal isolation and complex workflows due to the diversified targets and heterogeneous schemas. In this paper, we introduce OmniParser V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis, into a unified framework. Central to our approach is the proposed Structured-Points-of-Thought (SPOT) prompting schemas, which improves model performance across diverse scenarios by leveraging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AlibabaResearch/AdvancedLiterateMachinery
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.