A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
Qi You, Yitai Cheng, Zichao Zeng, James Haworth

TL;DR
This paper introduces CLIP-MHAdapter, a lightweight adaptation method for street-view image attribute classification that leverages attention-based feature modeling to improve accuracy while maintaining low computational costs.
Contribution
It proposes a novel lightweight adaptation framework using attention on patch tokens to enhance fine-grained attribute classification in street-view images.
Findings
Achieves state-of-the-art accuracy on the Global StreetScapes dataset.
Maintains low computational cost with only 1.4 million trainable parameters.
Outperforms existing methods across eight attribute classification tasks.
Abstract
Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
