OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition

Lixu Sun; Nurmemet Yolwas; Wushour Silamu

arXiv:2511.08133·cs.CV·November 12, 2025

OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition

Lixu Sun, Nurmemet Yolwas, Wushour Silamu

PDF

Open Access

TL;DR

This paper introduces OTSNet, a neurocognitive-inspired three-stage pipeline for scene text recognition that improves accuracy by addressing cross-modal misalignment and spatial distortions.

Contribution

The paper presents a novel three-stage architecture inspired by human visual perception, integrating attention, spatial-semantic modeling, and cross-modal verification for improved STR performance.

Findings

01

Achieves 83.5% accuracy on Union14M-L benchmark

02

Attains 79.1% accuracy on OST occluded dataset

03

Sets new records in 9 out of 14 evaluation scenarios

Abstract

Scene Text Recognition (STR) remains challenging due to real-world complexities, where decoupled visual-linguistic optimization in existing frameworks amplifies error propagation through cross-modal misalignment. Visual encoders exhibit attention bias toward background distractors, while decoders suffer from spatial misalignment when parsing geometrically deformed text-collectively degrading recognition accuracy for irregular patterns. Inspired by the hierarchical cognitive processes in human visual perception, we propose OTSNet, a novel three-stage network embodying a neurocognitive-inspired Observation-Thinking-Spelling pipeline for unified STR modeling. The architecture comprises three core components: (1) a Dual Attention Macaron Encoder (DAME) that refines visual features through differential attention maps to suppress irrelevant regions and enhance discriminative focus; (2) a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications