S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized   Variational Autoencoder for Video Prediction

Mohammad Adiban; Kalin Stefanov; Sabato Marco Siniscalchi; Giampiero; Salvi

arXiv:2307.06701·cs.CV·November 20, 2024

S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction

Mohammad Adiban, Kalin Stefanov, Sabato Marco Siniscalchi, Giampiero, Salvi

PDF

Open Access

TL;DR

This paper introduces S-HR-VQVAE, a novel model combining hierarchical residual vector quantized autoencoders with autoregressive spatiotemporal prediction, significantly improving video prediction accuracy and efficiency.

Contribution

The paper presents a new hierarchical residual VQVAE and an autoregressive model for video prediction, achieving better results with smaller models.

Findings

01

Outperforms state-of-the-art on multiple datasets

02

Reduces model size while maintaining accuracy

03

Effectively models spatiotemporal information

Abstract

We address the video prediction task by putting forth a novel model that combines (i) a novel hierarchical residual learning vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel autoregressive spatiotemporal predictive model (AST-PM). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the AST-PM's ability to handle spatiotemporal information, S-HR-VQVAE can better deal with major challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on four challenging tasks, namely KTH Human Action, TrafficBJ,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques

MethodsPixelCNN