Memory-V2V: Memory-Augmented Video-to-Video Diffusion for Consistent Multi-Turn Editing

Dohun Lee; Chun-Hao Paul Huang; Xuelin Chen; Jong Chul Ye; Duygu Ceylan; Hyeonho Jeong

arXiv:2601.16296·cs.CV·March 24, 2026

Memory-V2V: Memory-Augmented Video-to-Video Diffusion for Consistent Multi-Turn Editing

Dohun Lee, Chun-Hao Paul Huang, Xuelin Chen, Jong Chul Ye, Duygu Ceylan, Hyeonho Jeong

PDF

Open Access

TL;DR

Memory-V2V introduces a memory-augmented framework for multi-turn video editing that maintains consistency across iterative edits by leveraging external memory and relevance-aware retrieval, improving quality and coherence.

Contribution

The paper presents Memory-V2V, a novel memory-augmented approach that effectively preserves cross-turn consistency in iterative video editing workflows.

Findings

01

Significantly improves multi-turn video editing consistency.

02

Outperforms baseline methods in visual quality and coherence.

03

Maintains scalability with modest computational overhead.

Abstract

Video-to-video diffusion models achieve impressive single-turn editing performance, but practical editing workflows are inherently iterative. When edits are applied sequentially, existing models treat each turn independently, often causing previously generated regions to drift or be overwritten. We identify this failure mode as the problem of cross-turn consistency in multi-turn video editing. We introduce Memory-V2V, a memory-augmented framework that treats prior edits as structured constraints for subsequent generations. Memory-V2V maintains an external memory of previous outputs, retrieves task-relevant edits, and integrates them through relevance-aware tokenization and adaptive compression. These technical ingredients enable scalable conditioning without linear growth in computation. We demonstrate Memory-V2V on iterative video novel view synthesis and text-guided long video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications