RELIC: Interactive Video World Model with Long-Horizon Memory

Yicong Hong; Yiqun Mei; Chongjian Ge; Yiran Xu; Yang Zhou; Sai Bi; Yannick Hold-Geoffroy; Mike Roberts; Matthew Fisher; Eli Shechtman; Kalyan Sunkavalli; Feng Liu; Zhengqi Li; Hao Tan

arXiv:2512.04040·cs.CV·December 4, 2025

RELIC: Interactive Video World Model with Long-Horizon Memory

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, Hao Tan

PDF

Open Access

TL;DR

RELIC is a unified, real-time interactive video world model that combines long-horizon memory, spatial consistency, and user control, enabling detailed scene exploration and content retrieval over extended durations.

Contribution

RELIC introduces a novel framework integrating long-term memory, real-time performance, and user interaction, utilizing compressed latent tokens and a memory-efficient self-forcing paradigm.

Findings

01

Achieves 16 FPS real-time generation

02

Demonstrates improved long-horizon coherence

03

Provides more accurate action following

Abstract

A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging