InstructOCR: Instruction Boosting Scene Text Spotting
Chen Duan, Qianyi Jiang, Pei Fu, Jiamin Chen, Shengxi Li, Zining Wang,, Shan Guo, Junfeng Luo

TL;DR
InstructOCR introduces an instruction-based approach to scene text spotting that leverages human language instructions, improving accuracy and flexibility, and achieving state-of-the-art results on key benchmarks.
Contribution
The paper presents a novel instruction-driven framework for scene text spotting that enhances understanding and performance by integrating human language instructions during training and inference.
Findings
Achieves state-of-the-art results on standard benchmarks.
Improves downstream VQA task performance by over 2%.
Demonstrates the effectiveness of instruction strategies in OCR.
Abstract
In the field of scene text spotting, previous OCR methods primarily relied on image encoders and pre-trained text information, but they often overlooked the advantages of incorporating human language instructions. To address this gap, we propose InstructOCR, an innovative instruction-based scene text spotting model that leverages human language instructions to enhance the understanding of text within images. Our framework employs both text and image encoders during training and inference, along with instructions meticulously designed based on text attributes. This approach enables the model to interpret text more accurately and flexibly. Extensive experiments demonstrate the effectiveness of our model and we achieve state-of-the-art results on widely used benchmarks. Furthermore, the proposed framework can be seamlessly applied to scene text VQA tasks. By leveraging instruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Handwritten Text Recognition Techniques
