Resolving Spatio-Temporal Entanglement in Video Prediction via Multi-Modal Attention
Shreyam Gupta (1), P. Agrawal (2), Priyam Gupta (3) ((1) Indian Institute of Technology (BHU), Varanasi, India, (2) University of Colorado, Boulder, USA, (3) Intelligent Field Robotic Systems (IFRoS), University of Girona, Spain)

TL;DR
The paper introduces MAUCell, a novel multi-attention framework combining GANs and hierarchical processing to improve long-term, high-fidelity video prediction, achieving state-of-the-art results on multiple datasets.
Contribution
It presents MAUCell, an innovative architecture that addresses long-term coherence and detail in video prediction using multi-modal attention and hierarchical GAN strategies.
Findings
Achieves state-of-the-art performance on Moving MNIST, KTH Action, and CASIA-B datasets.
Outperforms existing models in LPIPS and SSIM metrics.
Demonstrates efficient real-time inference with high-quality video generation.
Abstract
The fast progress in computer vision has necessitated more advanced methods for temporal sequence modeling. This area is essential for the operation of autonomous systems, real-time surveillance, and predicting anomalies. As the demand for accurate video prediction increases, the limitations of traditional deterministic models, particularly their struggle to maintain long-term temporal coherence while providing high-frequency spatial detail, have become very clear. This report provides an exhaustive analysis of the Multi-Attention Unit Cell (MAUCell), a novel architectural framework that represents a significant leap forward in video frame prediction. By synergizing Generative Adversarial Networks (GANs) with a hierarchical "STAR-GAN" processing strategy and a triad of specialized attention mechanisms (Temporal, Spatial, and Pixel-wise), the MAUCell addresses the persistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
