TL;DR
This paper introduces a scalable, linear-time State-Space Model (Mamba) for OCR of historical newspapers, demonstrating competitive accuracy and improved efficiency over Transformer-based models.
Contribution
It presents the first OCR architecture based on SSMs, combining CNN encoders with Mamba sequence modeling, and provides a comprehensive benchmark against existing models.
Findings
Mamba models halve inference time compared to Transformer-based models.
All neural models achieve around 2% CER on historical newspaper OCR.
Mamba maintains competitive accuracy with superior memory efficiency.
Abstract
End-to-end OCR for historical newspapers remains challenging, as models must handle long text sequences, degraded print quality, and complex layouts. While Transformer-based recognizers dominate current research, their quadratic complexity limits efficient paragraph-level transcription and large-scale deployment. We investigate linear-time State-Space Models (SSMs), specifically Mamba, as a scalable alternative to Transformer-based sequence modeling for OCR. We present to our knowledge, the first OCR architecture based on SSMs, combining a CNN visual encoder with bi-directional and autoregressive Mamba sequence modeling, and conduct a large-scale benchmark comparing SSMs with Transformer- and BiLSTM-based recognizers. Multiple decoding strategies (CTC, autoregressive, and non-autoregressive) are evaluated under identical training conditions alongside strong neural baselines (VAN, DAN,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
