IIR-VLM: In-Context Instance-level Recognition for Large Vision-Language Models
Liang Shi, Wei Li, Kevin M Beussman, Lin Chen, Yun Fu

TL;DR
This paper introduces IIR-VLM, a novel vision-language model that enhances instance-level recognition by integrating specialized encoders, enabling effective one-shot learning and improved recognition of diverse instances across various categories.
Contribution
The paper proposes IIR-VLM, which incorporates ILR expert models into VLMs for in-context learning, addressing limitations in existing models for fine-grained, instance-level recognition.
Findings
Outperforms existing ILR models on personalization benchmarks.
Effective one-shot learning of new instances in diverse categories.
Demonstrates superior performance on a new challenging benchmark.
Abstract
Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Face recognition and analysis · Domain Adaptation and Few-Shot Learning
