EgoLCD: Egocentric Video Generation with Long Context Diffusion

Liuzhou Zhang; Jiarui Ye; Yuanlei Wang; Ming Zhong; Mingju Cao; Wanke Xia; Bowen Zeng; Zeyu Zhang; Hao Tang

arXiv:2512.04515·cs.CV·December 5, 2025

EgoLCD: Egocentric Video Generation with Long Context Diffusion

Liuzhou Zhang, Jiarui Ye, Yuanlei Wang, Ming Zhong, Mingju Cao, Wanke Xia, Bowen Zeng, Zeyu Zhang, Hao Tang

PDF

Open Access

TL;DR

EgoLCD is a novel framework for generating long, coherent egocentric videos by managing long-term memory effectively, improving temporal consistency and reducing content drift in video synthesis.

Contribution

The paper introduces EgoLCD, combining a sparse global memory, attention-based short-term memory, and structured prompts, advancing long-context egocentric video generation.

Findings

01

Achieves state-of-the-art results on EgoVid-5M benchmark.

02

Effectively mitigates content drift and generative forgetting.

03

Enhances temporal coherence in long video synthesis.

Abstract

Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Pose and Action Recognition