Levenshtein OCR

Cheng Da; Peng Wang; Cong Yao

arXiv:2209.03594·cs.CV·November 15, 2022

Levenshtein OCR

Cheng Da, Peng Wang, Cong Yao

PDF

Open Access 2 Repos

TL;DR

This paper introduces LevOCR, a novel scene text recognition method based on Vision-Language Transformer that iteratively refines predictions through deletion and insertion operations, achieving state-of-the-art results.

Contribution

LevOCR presents an innovative iterative sequence refinement approach for scene text recognition using a cross-modal transformer and imitation learning.

Findings

01

Achieves state-of-the-art performance on standard benchmarks.

02

Demonstrates effective parallel decoding and dynamic length change.

03

Validates the interpretability and robustness of the proposed method.

Abstract

A novel scene text recognizer based on Vision-Language Transformer (VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the proposed method (named Levenshtein OCR, and LevOCR for short) explores an alternative way for automatically transcribing textual content from cropped natural images. Specifically, we cast the problem of scene text recognition as an iterative sequence refinement process. The initial prediction sequence produced by a pure vision model is encoded and fed into a cross-modal transformer to interact and fuse with the visual features, to progressively approximate the ground truth. The refinement process is accomplished via two basic character-level operations: deletion and insertion, which are learned with imitation learning and allow for parallel decoding, dynamic length change and good interpretability. The quantitative experiments clearly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Absolute Position Encodings · Adam · Softmax · Residual Connection · Position-Wise Feed-Forward Layer · Dropout