Learning Plug-and-play Memory for Guiding Video Diffusion Models

Selena Song; Ziming Xu; Zijun Zhang; Kun Zhou; Jiaxian Guo; Lianhui Qin; Biwei Huang

arXiv:2511.19229·cs.CV·December 1, 2025

Learning Plug-and-play Memory for Guiding Video Diffusion Models

Selena Song, Ziming Xu, Zijun Zhang, Kun Zhou, Jiaxian Guo, Lianhui Qin, Biwei Huang

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces DiT-Mem, a plug-and-play memory module for diffusion-based video generation models that injects world knowledge to improve physical consistency and visual quality.

Contribution

The work proposes a learnable memory encoder integrated into diffusion transformers, enabling targeted guidance and improved physical rule adherence in video generation.

Findings

01

Enhanced physical rule compliance in generated videos

02

Improved visual fidelity and temporal coherence

03

Efficient training with limited data and parameters

Abstract

Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Thrcle/DiT-Mem-1.3B
model· ♡ 2
♡ 2

Datasets

Thrcle/DiT-Mem-Data
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · 3D Shape Modeling and Analysis