TL;DR
This paper introduces a framework that leverages unlabeled video to predict future visual representations, enabling anticipation of actions and objects before they occur, which is useful for real-world computer vision applications.
Contribution
It proposes a novel approach to anticipate future visual representations from unlabeled video by predicting high-level features, improving action and object anticipation.
Findings
Successfully anticipates actions one second ahead.
Predicts objects five seconds into the future.
Validates approach on two datasets.
Abstract
Anticipating actions and objects before they start or appear is a difficult problem in computer vision with several real-world applications. This task is challenging partly because it requires leveraging extensive knowledge of the world that is difficult to write down. We believe that a promising resource for efficiently learning this knowledge is through readily available unlabeled video. We present a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects. The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. Visual representations are a promising prediction target because they encode images at a higher semantic level than pixels yet are automatic to compute. We then apply recognition algorithms on our predicted representation to anticipate objects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Anticipating Visual Representations From Unlabeled Video· youtube
