Predicting Implicit Arguments in Procedural Video Instructions

Anil Batra; Laura Sevilla-Lara; Marcus Rohrbach; Frank Keller

arXiv:2505.21068·cs.CL·May 28, 2025

Predicting Implicit Arguments in Procedural Video Instructions

Anil Batra, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces Implicit-VidSRL, a new dataset for evaluating multimodal models' ability to infer implicit arguments in procedural videos, revealing current models' limitations and proposing an improved model with better inference capabilities.

Contribution

The paper presents Implicit-VidSRL, a novel dataset for implicit argument prediction in multimodal procedural videos, and proposes iSRL-Qwen2-VL, a model that improves inference accuracy over GPT-4o.

Findings

01

Multimodal models struggle to predict implicit arguments in procedural videos.

02

Implicit-VidSRL dataset benchmarks contextual reasoning in multimodal instructions.

03

Proposed iSRL-Qwen2-VL outperforms GPT-4o in implicit argument prediction.

Abstract

Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like {verb,what,where/with}. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step's where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models' contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

anilbatra/Implicit-VidSRL
dataset· 30 dl
30 dl

Videos

Predicting Implicit Arguments in Procedural Video Instructions· underline

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Software Engineering Research · Subtitles and Audiovisual Media