Temporally Grounding Instructional Diagrams in Unconstrained Videos
Jiahao Zhang, Frederic Z. Zhang, Cristian Rodriguez, Yizhak, Ben-Shabat, Anoop Cherian, Stephen Gould

TL;DR
This paper introduces a method for simultaneously localizing sequences of instructional diagrams in videos by leveraging composite queries and attention mechanisms, improving accuracy over existing single-query grounding methods.
Contribution
It proposes a novel approach using composite queries and self- and cross-attention to jointly ground multiple instructional diagrams in videos, capturing their interrelationships.
Findings
Significantly outperforms existing methods on IAW and YouCook2 datasets.
Effectively reduces timespan overlaps and maintains temporal order.
Demonstrates the benefit of joint grounding over single-query methods.
Abstract
We study the challenging problem of simultaneously localizing a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries such as the general mutual exclusiveness and the temporal order. Consequently, the predicted timespans of different step diagrams may overlap considerably or violate the temporal order, thus harming the accuracy. In this paper, we tackle this issue by simultaneously grounding a sequence of step diagrams. Specifically, we propose composite queries, constructed by exhaustively pairing up the visual content features of the step diagrams and a fixed number of learnable positional embeddings. Our insight is that self-attention among composite queries…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
MethodsFocus
