Visual Semantics Allow for Textual Reasoning Better in Scene Text   Recognition

Yue He; Chen Chen; Jing Zhang; Juhua Liu; Fengxiang He; Chaoyue Wang,; Bo Du

arXiv:2112.12916·cs.CV·December 28, 2021·1 cites

Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

Yue He, Chen Chen, Jing Zhang, Juhua Liu, Fengxiang He, Chaoyue Wang,, Bo Du

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel graph-based textual reasoning method for scene text recognition that leverages visual semantics and spatial context, significantly improving performance across multiple benchmarks.

Contribution

It proposes a graph convolutional network for textual reasoning based on visual semantics, integrating it with existing models to enhance scene text recognition accuracy.

Findings

01

Sets new state-of-the-art on six STR benchmarks

02

Generalizes effectively to multi-linguistic datasets

03

Improves performance by incorporating visual semantics into reasoning

Abstract

Existing Scene Text Recognition (STR) methods typically use a language model to optimize the joint probability of the 1D character sequence predicted by a visual recognition (VR) model, which ignore the 2D spatial context of visual semantics within and between character instances, making them not generalize well to arbitrary shape scene text. To address this issue, we make the first attempt to perform textual reasoning based on visual semantics in this paper. Technically, given the character segmentation maps predicted by a VR model, we construct a subgraph for each instance, where nodes represent the pixels in it and edges are added between nodes based on their spatial similarity. Then, these subgraphs are sequentially connected by their root nodes and merged into a complete graph. Based on this graph, we devise a graph convolutional network for textual reasoning (GTR) by supervising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

adeline-cs/GTR
pytorchOfficial

Videos

Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Human Pose and Action Recognition