Transferring General Multimodal Pretrained Models to Text Recognition
Junyang Lin, Xuancheng Ren, Yichang Zhang, Gao Liu, Peng Wang, An, Yang, Chang Zhou

TL;DR
This paper introduces OFA-OCR, a novel approach that leverages multimodal pretrained models for text recognition by framing it as image captioning, achieving state-of-the-art results without large-scale annotated data.
Contribution
The paper presents OFA-OCR, a method that transfers a unified vision-language pretrained model to text recognition, bypassing the need for extensive pretraining on text data.
Findings
OFA-OCR outperforms baseline methods on Chinese text recognition benchmarks.
The OCR pipeline with OFA-OCR achieves competitive product-level API performance.
The approach does not require large-scale annotated or synthetic text data.
Abstract
This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API. The code (https://github.com/OFA-Sys/OFA) and demo (https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary) are publicly available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Multimodal Machine Learning Applications
