Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting

Antonio Colombo,Giovanni Bianchi

arXiv:2605.18173·cs.CV·May 19, 2026

Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting

Antonio Colombo,Giovanni Bianchi

PDF

TL;DR

This paper introduces SAME-Net, a novel end-to-end scene text spotting framework that uses a soft attention mask embedding module to improve recognition accuracy without requiring text rectification.

Contribution

The paper proposes a new Soft Attention Mask Embedding module and integrates it into SAME-Net, enabling joint detection and recognition without auxiliary rectification modules.

Findings

01

Achieves 84.02% end-to-end H-mean on Total-Text, surpassing previous methods.

02

Attains 83.4% strong-lexicon accuracy on ICDAR 2015.

03

Effectively suppresses background noise and handles arbitrary-shaped text.

Abstract

End-to-end scene text spotting, which unifies text detection and recognition within a single framework, has witnessed remarkable progress driven by deep learning advances. However, most existing approaches still suffer from incomplete mask proposals caused by multi-scale variation, arbitrary text shapes, and complex background interference, thereby degrading recognition accuracy. In this paper, we propose a novel Soft Attention Mask Embedding module (SAME) that leverages the global receptive field of Transformer encoders to encode high-level features and compute soft attention weights, which are then hierarchically embedded with predicted masks to generate refined text-boundary-aware masks that effectively suppress background noise. Building upon this module, we present SAME-Net, a robust end-to-end text spotting framework that requires neither character-level annotations nor auxiliary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.