Predicting Implicit Arguments in Procedural Video Instructions
Anil Batra, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller

TL;DR
This paper introduces Implicit-VidSRL, a new dataset for evaluating multimodal models' ability to infer implicit arguments in procedural videos, revealing current models' limitations and proposing an improved model with better inference capabilities.
Contribution
The paper presents Implicit-VidSRL, a novel dataset for implicit argument prediction in multimodal procedural videos, and proposes iSRL-Qwen2-VL, a model that improves inference accuracy over GPT-4o.
Findings
Multimodal models struggle to predict implicit arguments in procedural videos.
Implicit-VidSRL dataset benchmarks contextual reasoning in multimodal instructions.
Proposed iSRL-Qwen2-VL outperforms GPT-4o in implicit argument prediction.
Abstract
Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like {verb,what,where/with}. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step's where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models' contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Software Engineering Research · Subtitles and Audiovisual Media
