Multiscale Video Pretraining for Long-Term Activity Forecasting
Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A. Plummer,, Kate Saenko, Karl Ridgeway, Lorenzo Torresani

TL;DR
This paper introduces Multiscale Video Pretraining (MVP), a self-supervised approach that learns robust video representations across multiple timescales, significantly improving long-term activity forecasting accuracy.
Contribution
MVP is a novel self-supervised pretraining method that captures multiscale temporal features for better long-term activity prediction in videos.
Findings
MVP outperforms existing self-supervised methods on long-term forecasting tasks.
Achieves over 20% relative accuracy improvement in video summary prediction.
Demonstrates effectiveness across multiple datasets including Ego4D and Epic-Kitchens.
Abstract
Long-term activity forecasting is an especially challenging research problem because it requires understanding the temporal relationships between observed actions, as well as the variability and complexity of human activities. Despite relying on strong supervision via expensive human annotations, state-of-the-art forecasting approaches often generalize poorly to unseen data. To alleviate this issue, we propose Multiscale Video Pretraining (MVP), a novel self-supervised pretraining approach that learns robust representations for forecasting by learning to predict contextualized representations of future video clips over multiple timescales. MVP is based on our observation that actions in videos have a multiscale nature, where atomic actions typically occur at a short timescale and more complex actions may span longer timescales. We compare MVP to state-of-the-art self-supervised video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Context-Aware Activity Recognition Systems
