Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting
Yan Bin Ng, Basura Fernando

TL;DR
This paper introduces a neural machine translation approach with attention for action sequence forecasting from videos, achieving state-of-the-art results in weakly supervised settings without frame-level annotations.
Contribution
It proposes a novel encoder-decoder model with a new loss function for action sequence forecasting, extending to weakly supervised learning on challenging datasets.
Findings
Outperforms state-of-the-art supervised models on Breakfast and 50Salads datasets.
Weakly supervised model achieves results close to fully supervised methods.
Attention mechanism and novel loss functions improve forecasting accuracy.
Abstract
Future human action forecasting from partial observations of activities is an important problem in many practical applications such as assistive robotics, video surveillance and security. We present a method to forecast actions for the unseen future of the video using a neural machine translation technique that uses encoder-decoder architecture. The input to this model is the observed RGB video, and the objective is to forecast the correct future symbolic action sequence. Unlike prior methods that make action predictions for some unseen percentage of video one for each frame, we predict the complete action sequence that is required to accomplish the activity. We coin this task action sequence forecasting. To cater for two types of uncertainty in the future predictions, we propose a novel loss function. We show a combination of optimal transport and future uncertainty losses help to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
