Transferring General Multimodal Pretrained Models to Text Recognition

Junyang Lin; Xuancheng Ren; Yichang Zhang; Gao Liu; Peng Wang; An; Yang; Chang Zhou

arXiv:2212.09297·cs.CV·December 20, 2022

Transferring General Multimodal Pretrained Models to Text Recognition

Junyang Lin, Xuancheng Ren, Yichang Zhang, Gao Liu, Peng Wang, An, Yang, Chang Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces OFA-OCR, a novel approach that leverages multimodal pretrained models for text recognition by framing it as image captioning, achieving state-of-the-art results without large-scale annotated data.

Contribution

The paper presents OFA-OCR, a method that transfers a unified vision-language pretrained model to text recognition, bypassing the need for extensive pretraining on text data.

Findings

01

OFA-OCR outperforms baseline methods on Chinese text recognition benchmarks.

02

The OCR pipeline with OFA-OCR achieves competitive product-level API performance.

03

The approach does not require large-scale annotated or synthetic text data.

Abstract

This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API. The code (https://github.com/OFA-Sys/OFA) and demo (https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary) are publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ofa-sys/ofa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Multimodal Machine Learning Applications