Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition
Yue He, Chen Chen, Jing Zhang, Juhua Liu, Fengxiang He, Chaoyue Wang,, Bo Du

TL;DR
This paper introduces a novel graph-based textual reasoning method for scene text recognition that leverages visual semantics and spatial context, significantly improving performance across multiple benchmarks.
Contribution
It proposes a graph convolutional network for textual reasoning based on visual semantics, integrating it with existing models to enhance scene text recognition accuracy.
Findings
Sets new state-of-the-art on six STR benchmarks
Generalizes effectively to multi-linguistic datasets
Improves performance by incorporating visual semantics into reasoning
Abstract
Existing Scene Text Recognition (STR) methods typically use a language model to optimize the joint probability of the 1D character sequence predicted by a visual recognition (VR) model, which ignore the 2D spatial context of visual semantics within and between character instances, making them not generalize well to arbitrary shape scene text. To address this issue, we make the first attempt to perform textual reasoning based on visual semantics in this paper. Technically, given the character segmentation maps predicted by a VR model, we construct a subgraph for each instance, where nodes represent the pixels in it and edges are added between nodes based on their spatial similarity. Then, these subgraphs are sequentially connected by their root nodes and merged into a complete graph. Based on this graph, we devise a graph convolutional network for textual reasoning (GTR) by supervising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Human Pose and Action Recognition
