Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features

Chancharik Mitra; Brandon Huang; Tianning Chai; Zhiqiu Lin; Assaf Arbelle; Rogerio Feris; Leonid Karlinsky; Trevor Darrell; Deva Ramanan; Roei Herzig

arXiv:2412.00142·cs.CV·June 10, 2025

Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features

Chancharik Mitra, Brandon Huang, Tianning Chai, Zhiqiu Lin, Assaf Arbelle, Rogerio Feris, Leonid Karlinsky, Trevor Darrell, Deva Ramanan, Roei Herzig

PDF

Open Access

TL;DR

This paper introduces Sparse Attention Vectors (SAVs), a finetuning-free method that extracts strong multimodal features from large models' latent space, significantly improving few-shot vision-language classification performance.

Contribution

The paper proposes SAVs, a novel approach leveraging sparse attention head activations for feature extraction, enabling effective few-shot classification without finetuning of large multimodal models.

Findings

01

SAVs achieve state-of-the-art results in few-shot vision-language tasks.

02

SAVs generalize well with additional examples and across similar tasks.

03

The method is finetuning-free and scalable, offering robust feature representations.

Abstract

Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks. Despite strong performance, LMMs' generative outputs are not specialized for vision-language classification tasks (i.e., tasks with vision-language inputs and discrete labels) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for these tasks is the extraction of useful features from generative LMMs. To overcome this, we propose an approach that leverages multimodal feature extraction from the LMM's latent space. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 5% of the heads) in LMMs as strong feature representations. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition · Natural Language Processing Techniques · linguistics and terminology studies

MethodsSoftmax · Attention Is All You Need