A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Qi You; Yitai Cheng; Zichao Zeng; James Haworth

arXiv:2602.16590·cs.CV·February 19, 2026

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Qi You, Yitai Cheng, Zichao Zeng, James Haworth

PDF

Open Access

TL;DR

This paper introduces CLIP-MHAdapter, a lightweight adaptation method for street-view image attribute classification that leverages attention-based feature modeling to improve accuracy while maintaining low computational costs.

Contribution

It proposes a novel lightweight adaptation framework using attention on patch tokens to enhance fine-grained attribute classification in street-view images.

Findings

01

Achieves state-of-the-art accuracy on the Global StreetScapes dataset.

02

Maintains low computational cost with only 1.4 million trainable parameters.

03

Outperforms existing methods across eight attribute classification tasks.

Abstract

Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications