Enhancing Energy Minimization Framework for Scene Text Recognition with Top-Down Cues
Anand Mishra, Karteek Alahari, C. V. Jawahar

TL;DR
This paper enhances scene text recognition by integrating bottom-up character detections with top-down language cues within an energy minimization framework, achieving improved accuracy on multiple benchmarks.
Contribution
It introduces a novel energy minimization model that combines character detection scores with lexicon-based language priors for scene text recognition.
Findings
Outperforms comparable methods on multiple datasets
Integrating CNN features further improves accuracy
Rigorous analysis validates each step of the approach
Abstract
Recognizing scene text is a challenging problem, even more so than the recognition of scanned documents. This problem has gained significant attention from the computer vision community in recent years, and several methods based on energy minimization frameworks and deep learning approaches have been proposed. In this work, we focus on the energy minimization framework and propose a model that exploits both bottom-up and top-down cues for recognizing cropped words extracted from street images. The bottom-up cues are derived from individual character detections from an image. We build a conditional random field model on these detections to jointly model the strength of the detections and the interactions between them. These interactions are top-down cues obtained from a lexicon-based prior, i.e., language statistics. The optimal word represented by the text image is obtained by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
