Keystep Recognition using Graph Neural Networks
Julia Lee Romero, Kyle Min, Subarna Tripathi, Morteza Karimzadeh

TL;DR
This paper introduces GLEVR, a graph neural network framework for keystep recognition in egocentric videos, leveraging long-term dependencies and multimodal data to outperform existing methods.
Contribution
The paper proposes a novel graph-learning framework, GLEVR, for fine-grained keystep recognition that effectively utilizes long-term dependencies and multimodal data in egocentric videos.
Findings
GLEVR outperforms existing models on the Ego-Exo4D dataset.
Constructed sparse graphs improve computational efficiency.
Alignment with exocentric videos enhances inference accuracy.
Abstract
We pose keystep recognition as a node classification task, and propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos. Our approach, termed GLEVR, consists of constructing a graph where each video clip of the egocentric video corresponds to a node. The constructed graphs are sparse and computationally efficient, outperforming existing larger models substantially. We further leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos, as well as adding automatic captioning as an additional modality. We consider each clip of each exocentric video (if available) or video captions as additional nodes during training. We examine several strategies to define connections across these nodes. We perform extensive experiments on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Handwritten Text Recognition Techniques
