Scene Text Recognition with Semantics
Joshua Cesare Placidi, Yishu Miao, Zixu Wang, Lucia Specia

TL;DR
This paper introduces a multimodal scene text recognition approach that incorporates semantic scene context via object tags into a transformer model, improving accuracy especially on noisy or obscured text images.
Contribution
It presents a novel method that fuses semantic scene information with visual data in a transformer architecture for enhanced scene text recognition.
Findings
Higher performance on noisy text images
Effective integration of semantic scene context
Outperforms traditional models on benchmark datasets
Abstract
Scene Text Recognition (STR) models have achieved high performance in recent years on benchmark datasets where text images are presented with minimal noise. Traditional STR recognition pipelines take a cropped image as sole input and attempt to identify the characters present. This infrastructure can fail in instances where the input image is noisy or the text is partially obscured. This paper proposes using semantic information from the greater scene to contextualise predictions. We generate semantic vectors using object tags and fuse this information into a transformer-based architecture. The results demonstrate that our multimodal approach yields higher performance than traditional benchmark models, particularly on noisy instances.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Text and Document Classification Technologies
