GraphVid: It Only Takes a Few Nodes to Understand a Video
Eitan Kosman, Dotan Di Castro

TL;DR
GraphVid introduces a graph-based video representation using superpixels and GCNs, significantly reducing computational costs while maintaining competitive performance, enabling efficient video understanding on limited hardware.
Contribution
The paper presents a novel superpixel-based graph representation for videos processed with GCNs, achieving high efficiency and comparable accuracy with fewer resources.
Findings
Reduces computational requirements 10-fold
Maintains competitive accuracy on Kinetics-400 and Charades
Enables resource-limited hardware to perform video analysis
Abstract
We propose a concise representation of videos that encode perceptually meaningful features into graphs. With this representation, we aim to leverage the large amount of redundancies in videos and save computations. First, we construct superpixel-based graph representations of videos by considering superpixels as graph nodes and create spatial and temporal connections between adjacent superpixels. Then, we leverage Graph Convolutional Networks to process this representation and predict the desired output. As a result, we are able to train models with much fewer parameters, which translates into short training periods and a reduction in computation resource requirements. A comprehensive experimental study on the publicly available datasets Kinetics-400 and Charades shows that the proposed method is highly cost-effective and uses limited commodity hardware during training and inference. It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Visual Attention and Saliency Detection · Image and Video Quality Assessment
