Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision

Qiang Lu; Waikit Xiu; Xiying Li; Shenyu Hu; Shengbo Sun

arXiv:2507.23331·cs.CV·August 1, 2025

Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision

Qiang Lu, Waikit Xiu, Xiying Li, Shenyu Hu, Shengbo Sun

PDF

Open Access

TL;DR

This paper introduces a novel multi-modal framework combining open-vocabulary detection and contrastive learning to improve traffic sign recognition, especially for small, rare, and scale-variant signs in real-world scenarios.

Contribution

It proposes a two-stage framework with a specialized detection model and a contrastive learning approach that enhances recognition of long-tail and multi-scale traffic signs.

Findings

01

Achieves 78.4% mAP on TT100K for long-tail detection

02

Attains 91.8% accuracy and 88.9% recall in recognition tasks

03

Outperforms mainstream algorithms in complex scenarios

Abstract

Traffic sign recognition, as a core component of autonomous driving perception systems, directly influences vehicle environmental awareness and driving safety. Current technologies face two significant challenges: first, the traffic sign dataset exhibits a pronounced long-tail distribution, resulting in a substantial decline in recognition performance of traditional convolutional networks when processing low-frequency and out-of-distribution classes; second, traffic signs in real-world scenarios are predominantly small targets with significant scale variations, making it difficult to extract multi-scale features.To overcome these issues, we propose a novel two-stage framework combining open-vocabulary detection and cross-modal learning. For traffic sign detection, our NanoVerse YOLO model integrates a reparameterizable vision-language path aggregation network (RepVL-PAN) and an SPD-Conv…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Warnings and Signage · Handwritten Text Recognition Techniques