Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision
Qiang Lu, Waikit Xiu, Xiying Li, Shenyu Hu, Shengbo Sun

TL;DR
This paper introduces a novel multi-modal framework combining open-vocabulary detection and contrastive learning to improve traffic sign recognition, especially for small, rare, and scale-variant signs in real-world scenarios.
Contribution
It proposes a two-stage framework with a specialized detection model and a contrastive learning approach that enhances recognition of long-tail and multi-scale traffic signs.
Findings
Achieves 78.4% mAP on TT100K for long-tail detection
Attains 91.8% accuracy and 88.9% recall in recognition tasks
Outperforms mainstream algorithms in complex scenarios
Abstract
Traffic sign recognition, as a core component of autonomous driving perception systems, directly influences vehicle environmental awareness and driving safety. Current technologies face two significant challenges: first, the traffic sign dataset exhibits a pronounced long-tail distribution, resulting in a substantial decline in recognition performance of traditional convolutional networks when processing low-frequency and out-of-distribution classes; second, traffic signs in real-world scenarios are predominantly small targets with significant scale variations, making it difficult to extract multi-scale features.To overcome these issues, we propose a novel two-stage framework combining open-vocabulary detection and cross-modal learning. For traffic sign detection, our NanoVerse YOLO model integrates a reparameterizable vision-language path aggregation network (RepVL-PAN) and an SPD-Conv…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Warnings and Signage · Handwritten Text Recognition Techniques
