Online In-Context Distillation for Low-Resource Vision Language Models

Zhiqi Kang; Rahaf Aljundi; Vaggelis Dorovatas; Karteek Alahari

arXiv:2510.18117·cs.CV·April 8, 2026

Online In-Context Distillation for Low-Resource Vision Language Models

Zhiqi Kang, Rahaf Aljundi, Vaggelis Dorovatas, Karteek Alahari

PDF

TL;DR

This paper introduces an online in-context distillation method that enables small vision-language models to effectively leverage larger teacher models during inference, significantly improving their performance in low-resource settings.

Contribution

The paper proposes a novel online in-context distillation approach with demonstration selection and test-time scaling, enabling small models to match larger models' performance efficiently.

Findings

01

Small models improve up to 33% with minimal teacher annotations.

02

ICD outperforms fine-tuning under limited compute budgets.

03

Method achieves competitive performance with the teacher's zero-shot results.

Abstract

As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.