Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos
Luigi Seminara, Giovanni Maria Farinella, Antonino Furnari

TL;DR
This paper introduces a differentiable, neural network-compatible method for learning task graphs from egocentric videos, improving procedural activity understanding and online mistake detection accuracy.
Contribution
It presents a novel gradient-based approach for learning task graphs directly from video data, enabling better procedural activity modeling and mistake detection.
Findings
Achieved +16.7% accuracy in task graph prediction over previous methods.
Enhanced online mistake detection with +19.8% and +7.5% improvements on two datasets.
Demonstrated emerging video understanding abilities from textual and video embeddings.
Abstract
Procedural activities are sequences of key-steps aimed at achieving specific goals. They are crucial to build intelligent agents able to assist users effectively. In this context, task graphs have emerged as a human-understandable representation of procedural activities, encoding a partial ordering over the key-steps. While previous works generally relied on hand-crafted procedures to extract task graphs from videos, in this paper, we propose an approach based on direct maximum likelihood optimization of edges' weights, which allows gradient-based learning of task graphs and can be naturally plugged into neural network architectures. Experiments on the CaptainCook4D dataset demonstrate the ability of our approach to predict accurate task graphs from the observation of action sequences, with an improvement of +16.7% over previous approaches. Owing to the differentiability of the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Graph Neural Networks · Online Learning and Analytics · Human Pose and Action Recognition
