TL;DR
OmniParser V2 introduces a unified, task-agnostic framework for visual text parsing tasks using Structured-Points-of-Thought prompting, achieving state-of-the-art results and demonstrating versatility with multimodal large language models.
Contribution
It proposes SPOT prompting schemas and a unified encoder-decoder architecture to simplify and generalize visual text parsing across multiple tasks.
Findings
Achieves state-of-the-art or competitive results on four VsTP tasks.
Effectively integrates with multimodal large language models.
Simplifies the processing pipeline by eliminating task-specific architectures.
Abstract
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individual tasks. This leads to modal isolation and complex workflows due to the diversified targets and heterogeneous schemas. In this paper, we introduce OmniParser V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis, into a unified framework. Central to our approach is the proposed Structured-Points-of-Thought (SPOT) prompting schemas, which improves model performance across diverse scenarios by leveraging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
