Levenshtein OCR
Cheng Da, Peng Wang, Cong Yao

TL;DR
This paper introduces LevOCR, a novel scene text recognition method based on Vision-Language Transformer that iteratively refines predictions through deletion and insertion operations, achieving state-of-the-art results.
Contribution
LevOCR presents an innovative iterative sequence refinement approach for scene text recognition using a cross-modal transformer and imitation learning.
Findings
Achieves state-of-the-art performance on standard benchmarks.
Demonstrates effective parallel decoding and dynamic length change.
Validates the interpretability and robustness of the proposed method.
Abstract
A novel scene text recognizer based on Vision-Language Transformer (VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the proposed method (named Levenshtein OCR, and LevOCR for short) explores an alternative way for automatically transcribing textual content from cropped natural images. Specifically, we cast the problem of scene text recognition as an iterative sequence refinement process. The initial prediction sequence produced by a pure vision model is encoded and fed into a cross-modal transformer to interact and fuse with the visual features, to progressively approximate the ground truth. The refinement process is accomplished via two basic character-level operations: deletion and insertion, which are learned with imitation learning and allow for parallel decoding, dynamic length change and good interpretability. The quantitative experiments clearly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Absolute Position Encodings · Adam · Softmax · Residual Connection · Position-Wise Feed-Forward Layer · Dropout
