VideoGraph: Recognizing Minutes-Long Human Activities in Videos
Noureldien Hussein, Efstratios Gavves, Arnold W.M. Smeulders

TL;DR
VideoGraph is a novel graph-based approach that models minutes-long human activities in videos by learning temporal structure directly from data, outperforming existing methods on benchmark datasets.
Contribution
It introduces a fully data-driven graph representation for long-duration activities, capturing temporal dependencies without requiring node-level annotations.
Findings
Outperforms related methods on Epic-Kitchen and Breakfast datasets
Successfully models minutes-long temporal dependencies
Learns activity structure directly from video data
Abstract
Many human activities take minutes to unfold. To represent them, related works opt for statistical pooling, which neglects the temporal structure. Others opt for convolutional methods, as CNN and Non-Local. While successful in learning temporal concepts, they are short of modeling minutes-long temporal dependencies. We propose VideoGraph, a method to achieve the best of two worlds: represent minutes-long human activities and learn their underlying temporal structure. VideoGraph learns a graph-based representation for human activities. The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation. The result is improvements over related works on benchmarks: Epic-Kitchen and Breakfast. Besides, we demonstrate that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Surveillance and Tracking Methods
