Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting
Antonio Colombo,Giovanni Bianchi

TL;DR
This paper introduces SAME-Net, a novel end-to-end scene text spotting framework that uses a soft attention mask embedding module to improve recognition accuracy without requiring text rectification.
Contribution
The paper proposes a new Soft Attention Mask Embedding module and integrates it into SAME-Net, enabling joint detection and recognition without auxiliary rectification modules.
Findings
Achieves 84.02% end-to-end H-mean on Total-Text, surpassing previous methods.
Attains 83.4% strong-lexicon accuracy on ICDAR 2015.
Effectively suppresses background noise and handles arbitrary-shaped text.
Abstract
End-to-end scene text spotting, which unifies text detection and recognition within a single framework, has witnessed remarkable progress driven by deep learning advances. However, most existing approaches still suffer from incomplete mask proposals caused by multi-scale variation, arbitrary text shapes, and complex background interference, thereby degrading recognition accuracy. In this paper, we propose a novel Soft Attention Mask Embedding module (SAME) that leverages the global receptive field of Transformer encoders to encode high-level features and compute soft attention weights, which are then hierarchically embedded with predicted masks to generate refined text-boundary-aware masks that effectively suppress background noise. Building upon this module, we present SAME-Net, a robust end-to-end text spotting framework that requires neither character-level annotations nor auxiliary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
