SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection
Mohammed-En-Nadhir Zighem, Abdenour Hadid

TL;DR
SAViL-Det is a semantic-aware vision-language model that significantly improves multi-script text detection in natural scenes by integrating textual prompts with visual features through a novel decoder and contrastive learning.
Contribution
It introduces a new framework combining CLIP, AFPN, and a language-vision decoder with contrastive learning for enhanced multi-script text detection.
Findings
Achieves state-of-the-art F-score of 84.8% on MLT-2019
Achieves state-of-the-art F-score of 90.2% on CTW1500
Effectively leverages semantic context for diverse script detection
Abstract
Detecting text in natural scenes remains challenging, particularly for diverse scripts and arbitrarily shaped instances where visual cues alone are often insufficient. Existing methods do not fully leverage semantic context. This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection by effectively integrating textual prompts with visual features. SAViL-Det utilizes a pre-trained CLIP model combined with an Asymptotic Feature Pyramid Network (AFPN) for multi-scale visual feature fusion. The core of the proposed framework is a novel language-vision decoder that adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention. Furthermore, a text-to-pixel contrastive learning mechanism explicitly aligns textual and corresponding visual pixel features. Extensive experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Topic Modeling
