Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Yuhao Dong; Shulin Tian; Shuai Liu; Shuangrui Ding; Yuhang Zang; Xiaoyi Dong; Yuhang Cao; Jiaqi Wang; Ziwei Liu

arXiv:2602.08439·cs.CV·February 10, 2026

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Yuhao Dong, Shulin Tian, Shuai Liu, Shuangrui Ding, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Ziwei Liu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Demo-ICL, a new in-context learning task for videos, along with a benchmark, demonstrating how models can learn from demonstrations to answer questions about videos, advancing video understanding capabilities.

Contribution

The paper proposes Demo-ICL, a novel task and benchmark for video in-context learning, and develops a specialized model with a two-stage training strategy to improve learning from demonstrations.

Findings

01

Demo-ICL-Bench is challenging for current models.

02

Demo-ICL improves models' ability to learn from demonstrations.

03

State-of-the-art models show significant room for improvement.

Abstract

Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper is well-written and easy to follow. 2. The motivation is strong — exploring In-Context Learning for video understanding is both important and underexplored, addressing a genuine gap in current MLLM research. 3. The work is comprehensive, presenting not only a new task formulation but also a dedicated benchmark and a well-designed method to address it. 4. The experimental evaluation is thorough, including results on both the proposed Demo-ICL-Bench and general benchmarks, and providi

Weaknesses

1. The questions in the benchmark appear overly detailed, which may reduce the contribution of the target video itself. I'm very curious about the performance of the Text-demo In-Context Learning subtask without the target video (i.e., question-only accuracy). 2. The paper lacks detailed explanation and examples for the Demonstration Selection subtask. It is unclear whether each video is accompanied by additional information, such as subtitles or ASR transcripts. 3. Beyond providing a new evalua

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper invents a new learning paradigm—demo-driven video in-context learning—by reframing video understanding as “watch a tutorial, then perform the next step in a different video”. This moves the community goalpost from static recognition to rapid procedural knowledge uptake, an angle no prior benchmark or method has systematically targeted. 2. Demo-ICL-Bench is built with industrial-grade rigor: 1,200 YouTube instructional videos filtered for language, length, and narrated step boundarie

Weaknesses

1. In the third task the model must first retrieve the correct tutorial from a 200-video pool and then answer. The paper only reports top-1 selection accuracy (S-Acc) and final QA score, but never analyzes why retrieval fails. 2. Evaluation metric blind to step-granularity. Accuracy is binary on the next step; it gives no partial credit when the model predicts “add onion” instead of “add onion and garlic.” A softer metric such as human preference win-rate would reward models that learn coarse

Reviewer 03Rating 2Confidence 3

Strengths

1. The task of demo-driven video in-context learning is conceptually interesting and relevant for next-generation multimodal reasoning, particularly for applications like robotics or educational AI systems. 2. The benchmark covers three distinct tasks: text-demo, video-demo, and demonstration selection.

Weaknesses

1. The claims are overstated — particularly the assertion that Demo-ICL “unveils future research directions” and “demonstrates superior knowledge acquisition.” The performance margins are relatively small and inconsistent across subtasks. 2. Several sentences and sections are unclear (e.g., descriptions of how demonstrations are chosen and how textual vs. video modalities interact). It’s often difficult to follow the flow between task setup, dataset construction, and evaluation. 3. The abstract

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling