VIOLA: Towards Video In-Context Learning with Minimal Annotations
Ryo Fujii, Hideo Saito, Ryo Hachiuma

TL;DR
VIOLA introduces a label-efficient video in-context learning framework that combines minimal expert annotations with unlabeled data, employing density-based sampling and confidence-aware retrieval to enhance multimodal model adaptation.
Contribution
The paper presents a novel hybrid approach that effectively utilizes minimal annotations and unlabeled data for video domain adaptation in multimodal large language models.
Findings
Significantly outperforms baselines in low-resource settings
Achieves robust adaptation with minimal annotation costs
Demonstrates effectiveness across nine diverse benchmarks
Abstract
Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts' annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
