Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection

Adhemar de Senneville; Xavier Bou; J\'er\'emy Anger; Rafael Grompone; and Gabriele Facciolo

arXiv:2603.24181·cs.CV·March 26, 2026

Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection

Adhemar de Senneville, Xavier Bou, J\'er\'emy Anger, Rafael Grompone, and Gabriele Facciolo

PDF

Open Access

TL;DR

This paper demonstrates that LVLMs can be enhanced for few-shot image classification through prompt conditioning and head selection, achieving state-of-the-art results without additional training.

Contribution

The paper introduces Head Ensemble Classifiers (HEC), a training-free method that leverages attention head importance to improve LVLM classification performance.

Findings

01

LVLMs' internal attention heads outperform raw models in classification.

02

Prompt conditioning improves visual feature separability.

03

HEC achieves state-of-the-art few-shot and zero-shot classification results.

Abstract

Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP's architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs' internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques