Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
Wei-Chia Chang, Yan-Ann Chen

TL;DR
This paper introduces a zero-shot vehicle recognition system that combines vision-language models with retrieval-augmented generation, enabling accurate identification of new vehicle models without retraining.
Contribution
It proposes a novel pipeline integrating VLMs and RAG for zero-shot vehicle recognition, avoiding large retraining and allowing quick updates with textual descriptions.
Findings
Achieved nearly 20% improvement over CLIP baseline.
Demonstrated effective zero-shot recognition of new vehicle models.
Enabled rapid updates by adding textual descriptions of vehicles.
Abstract
Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
