TL;DR
This paper explores whether semantic extraction should precede or follow video prediction to improve scene understanding, using LFDTN and U-Net models on synthetic and real datasets.
Contribution
It systematically compares the two configurations of semantic prediction and demonstrates their impact on scene understanding tasks.
Findings
Semantic extraction before prediction can enhance scene interpretation.
The order of prediction and semantics extraction affects downstream task performance.
Empirical evaluation on datasets shows the advantages of the proposed approach.
Abstract
The ultimate goal of video prediction is not forecasting future pixel-values given some previous frames. Rather, the end goal of video prediction is to discover valuable internal representations from the vast amount of available unlabeled video data in a self-supervised fashion for downstream tasks. One of the primary downstream tasks is interpreting the scene's semantic composition and using it for decision-making. For example, by predicting human movements, an observer can anticipate human activities and collaborate in a shared workspace. There are two main ways to achieve the same outcome, given a pre-trained video prediction and pre-trained semantic extraction model; one can first apply predictions and then extract semantics or first extract semantics and then predict. We investigate these configurations using the Local Frequency Domain Transformer Network (LFDTN) as the video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Convolution · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing
