UPOCR: Towards Unified Pixel-Level OCR Interface

Dezhi Peng; Zhenhua Yang; Jiaxin Zhang; Chongyu Liu; Yongxin Shi; Kai; Ding; Fengjun Guo; Lianwen Jin

arXiv:2312.02694·cs.CV·December 6, 2023·1 cites

UPOCR: Towards Unified Pixel-Level OCR Interface

Dezhi Peng, Zhenhua Yang, Jiaxin Zhang, Chongyu Liu, Yongxin Shi, Kai, Ding, Fengjun Guo, Lianwen Jin

PDF

Open Access

TL;DR

UPOCR introduces a unified vision Transformer-based model for pixel-level OCR tasks, enabling simultaneous high performance across diverse applications like text removal, segmentation, and tampered text detection.

Contribution

The paper presents UPOCR, a novel unified OCR model that uses image-to-image transformation and task prompts to handle multiple pixel-level OCR tasks with a single architecture.

Findings

01

Achieves state-of-the-art results on three OCR tasks

02

Demonstrates effective task-specific feature learning with prompts

03

Simplifies OCR model deployment with a unified approach

Abstract

In recent years, the optical character recognition (OCR) field has been proliferating with plentiful cutting-edge approaches for a wide spectrum of tasks. However, these approaches are task-specifically designed with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder. Learnable task prompts are introduced to push the general feature representations extracted by the encoder toward task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Dense Connections · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer