Temporal DINO: A Self-supervised Video Strategy to Enhance Action Prediction
Izzeddin Teeti, Rongali Sai Bhargav, Vivek Singh, Andrew Bradley,, Biplab Banerjee, Fabio Cuzzolin

TL;DR
Temporal DINO introduces a self-supervised video strategy that leverages teacher-student models to improve action prediction by capturing long-term dependencies, reducing data requirements, and streamlining training.
Contribution
The paper presents a novel self-supervised approach using teacher-student models for action prediction, enhancing performance across multiple architectures with less data and training complexity.
Findings
Achieved an average of 9.9% improvement in prediction accuracy.
Effective across various backbone architectures like 3D-ResNet, Transformer, LSTM.
Reduces reliance on extensive labeled data and complex augmentations.
Abstract
The emerging field of action prediction plays a vital role in various computer vision applications such as autonomous driving, activity analysis and human-computer interaction. Despite significant advancements, accurately predicting future actions remains a challenging problem due to high dimensionality, complex dynamics and uncertainties inherent in video data. Traditional supervised approaches require large amounts of labelled data, which is expensive and time-consuming to obtain. This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels). The Temporal-DINO approach employs two models; a 'student' processing past frames; and a 'teacher' processing both past and future frames, enabling a broader temporal context. During training, the teacher guides the student to learn future context by only observing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Anomaly Detection Techniques and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Vision Transformer · Linear Layer · Softmax · Layer Normalization · Label Smoothing · Adam · Residual Connection · Dense Connections
