Multiscale Video Pretraining for Long-Term Activity Forecasting

Reuben Tan; Matthias De Lange; Michael Iuzzolino; Bryan A. Plummer,; Kate Saenko; Karl Ridgeway; Lorenzo Torresani

arXiv:2307.12854·cs.CV·July 25, 2023

Multiscale Video Pretraining for Long-Term Activity Forecasting

Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A. Plummer,, Kate Saenko, Karl Ridgeway, Lorenzo Torresani

PDF

Open Access

TL;DR

This paper introduces Multiscale Video Pretraining (MVP), a self-supervised approach that learns robust video representations across multiple timescales, significantly improving long-term activity forecasting accuracy.

Contribution

MVP is a novel self-supervised pretraining method that captures multiscale temporal features for better long-term activity prediction in videos.

Findings

01

MVP outperforms existing self-supervised methods on long-term forecasting tasks.

02

Achieves over 20% relative accuracy improvement in video summary prediction.

03

Demonstrates effectiveness across multiple datasets including Ego4D and Epic-Kitchens.

Abstract

Long-term activity forecasting is an especially challenging research problem because it requires understanding the temporal relationships between observed actions, as well as the variability and complexity of human activities. Despite relying on strong supervision via expensive human annotations, state-of-the-art forecasting approaches often generalize poorly to unseen data. To alleviate this issue, we propose Multiscale Video Pretraining (MVP), a novel self-supervised pretraining approach that learns robust representations for forecasting by learning to predict contextualized representations of future video clips over multiple timescales. MVP is based on our observation that actions in videos have a multiscale nature, where atomic actions typically occur at a short timescale and more complex actions may span longer timescales. We compare MVP to state-of-the-art self-supervised video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Context-Aware Activity Recognition Systems