Cross-domain Multi-step Thinking: Zero-shot Fine-grained Traffic Sign Recognition in the Wild

Yaozong Gan; Guang Li; Ren Togo; Keisuke Maeda; Takahiro Ogawa; Miki Haseyama

arXiv:2409.01534·cs.CV·July 24, 2025

Cross-domain Multi-step Thinking: Zero-shot Fine-grained Traffic Sign Recognition in the Wild

Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

PDF

Open Access

TL;DR

This paper introduces Cross-domain Multi-step Thinking (CdMT), a novel framework leveraging large multimodal models for zero-shot, fine-grained traffic sign recognition across different countries and domains, without training data.

Contribution

The study proposes a new CdMT framework that uses multi-step reasoning with context, characteristic, and differential descriptions to improve zero-shot traffic sign recognition in the wild.

Findings

01

Achieved high recognition accuracies on multiple benchmark datasets.

02

Outperformed state-of-the-art methods across all tested datasets.

03

Demonstrated effectiveness in cross-country traffic sign recognition scenarios.

Abstract

In this study, we propose Cross-domain Multi-step Thinking (CdMT) to improve zero-shot fine-grained traffic sign recognition (TSR) performance in the wild. Zero-shot fine-grained TSR in the wild is challenging due to the cross-domain problem between clean template traffic signs and real-world counterparts, and existing approaches particularly struggle with cross-country TSR scenarios, where traffic signs typically differ between countries. The proposed CdMT framework tackles these challenges by leveraging the multi-step reasoning capabilities of large multimodal models (LMMs). We introduce context, characteristic, and differential descriptions to design multiple thinking processes for LMMs. Context descriptions, which are enhanced by center coordinate prompt optimization, enable the precise localization of target traffic signs in complex road images and filter irrelevant responses via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Vehicle License Plate Recognition · Image Processing and 3D Reconstruction