VIOLA: Towards Video In-Context Learning with Minimal Annotations

Ryo Fujii; Hideo Saito; Ryo Hachiuma

arXiv:2601.15549·cs.CV·January 23, 2026

VIOLA: Towards Video In-Context Learning with Minimal Annotations

Ryo Fujii, Hideo Saito, Ryo Hachiuma

PDF

Open Access

TL;DR

VIOLA introduces a label-efficient video in-context learning framework that combines minimal expert annotations with unlabeled data, employing density-based sampling and confidence-aware retrieval to enhance multimodal model adaptation.

Contribution

The paper presents a novel hybrid approach that effectively utilizes minimal annotations and unlabeled data for video domain adaptation in multimodal large language models.

Findings

01

Significantly outperforms baselines in low-resource settings

02

Achieves robust adaptation with minimal annotation costs

03

Demonstrates effectiveness across nine diverse benchmarks

Abstract

Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts' annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis