SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection

Mohammed-En-Nadhir Zighem; Abdenour Hadid

arXiv:2507.20188·cs.CV·July 29, 2025

SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection

Mohammed-En-Nadhir Zighem, Abdenour Hadid

PDF

Open Access

TL;DR

SAViL-Det is a semantic-aware vision-language model that significantly improves multi-script text detection in natural scenes by integrating textual prompts with visual features through a novel decoder and contrastive learning.

Contribution

It introduces a new framework combining CLIP, AFPN, and a language-vision decoder with contrastive learning for enhanced multi-script text detection.

Findings

01

Achieves state-of-the-art F-score of 84.8% on MLT-2019

02

Achieves state-of-the-art F-score of 90.2% on CTW1500

03

Effectively leverages semantic context for diverse script detection

Abstract

Detecting text in natural scenes remains challenging, particularly for diverse scripts and arbitrarily shaped instances where visual cues alone are often insufficient. Existing methods do not fully leverage semantic context. This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection by effectively integrating textual prompts with visual features. SAViL-Det utilizes a pre-trained CLIP model combined with an Asymptotic Feature Pyramid Network (AFPN) for multi-scale visual feature fusion. The core of the proposed framework is a novel language-vision decoder that adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention. Furthermore, a text-to-pixel contrastive learning mechanism explicitly aligns textual and corresponding visual pixel features. Extensive experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Topic Modeling