IIR-VLM: In-Context Instance-level Recognition for Large Vision-Language Models

Liang Shi; Wei Li; Kevin M Beussman; Lin Chen; Yun Fu

arXiv:2601.14188·cs.CV·January 21, 2026

IIR-VLM: In-Context Instance-level Recognition for Large Vision-Language Models

Liang Shi, Wei Li, Kevin M Beussman, Lin Chen, Yun Fu

PDF

Open Access

TL;DR

This paper introduces IIR-VLM, a novel vision-language model that enhances instance-level recognition by integrating specialized encoders, enabling effective one-shot learning and improved recognition of diverse instances across various categories.

Contribution

The paper proposes IIR-VLM, which incorporates ILR expert models into VLMs for in-context learning, addressing limitations in existing models for fine-grained, instance-level recognition.

Findings

01

Outperforms existing ILR models on personalization benchmarks.

02

Effective one-shot learning of new instances in diverse categories.

03

Demonstrates superior performance on a new challenging benchmark.

Abstract

Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Face recognition and analysis · Domain Adaptation and Few-Shot Learning