Resolving Spatio-Temporal Entanglement in Video Prediction via Multi-Modal Attention

Shreyam Gupta (1); P. Agrawal (2); Priyam Gupta (3) ((1) Indian Institute of Technology (BHU); Varanasi; India; (2) University of Colorado; Boulder; USA; (3) Intelligent Field Robotic Systems (IFRoS); University of Girona; Spain)

arXiv:2501.16997·cs.CV·March 31, 2026

Resolving Spatio-Temporal Entanglement in Video Prediction via Multi-Modal Attention

Shreyam Gupta (1), P. Agrawal (2), Priyam Gupta (3) ((1) Indian Institute of Technology (BHU), Varanasi, India, (2) University of Colorado, Boulder, USA, (3) Intelligent Field Robotic Systems (IFRoS), University of Girona, Spain)

PDF

TL;DR

The paper introduces MAUCell, a novel multi-attention framework combining GANs and hierarchical processing to improve long-term, high-fidelity video prediction, achieving state-of-the-art results on multiple datasets.

Contribution

It presents MAUCell, an innovative architecture that addresses long-term coherence and detail in video prediction using multi-modal attention and hierarchical GAN strategies.

Findings

01

Achieves state-of-the-art performance on Moving MNIST, KTH Action, and CASIA-B datasets.

02

Outperforms existing models in LPIPS and SSIM metrics.

03

Demonstrates efficient real-time inference with high-quality video generation.

Abstract

The fast progress in computer vision has necessitated more advanced methods for temporal sequence modeling. This area is essential for the operation of autonomous systems, real-time surveillance, and predicting anomalies. As the demand for accurate video prediction increases, the limitations of traditional deterministic models, particularly their struggle to maintain long-term temporal coherence while providing high-frequency spatial detail, have become very clear. This report provides an exhaustive analysis of the Multi-Attention Unit Cell (MAUCell), a novel architectural framework that represents a significant leap forward in video frame prediction. By synergizing Generative Adversarial Networks (GANs) with a hierarchical "STAR-GAN" processing strategy and a triad of specialized attention mechanisms (Temporal, Spatial, and Pixel-wise), the MAUCell addresses the persistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.