Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models
Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe Lin, Sung Ju Hwang

TL;DR
Frame Guidance introduces a training-free method for fine-grained, controllable video generation using frame-level signals, leveraging a simple latent processing technique and latent optimization to achieve high-quality, coherent videos across diverse tasks.
Contribution
It proposes a novel training-free guidance approach for controllable video synthesis that works with any video model, using a simple latent processing and optimization strategy.
Findings
Enables effective control with keyframes, style images, sketches, and depth maps.
Reduces memory usage significantly with a simple latent processing method.
Produces high-quality, coherent videos across various tasks without training.
Abstract
Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models.…
Peer Reviews
Decision·ICLR 2026 Poster
Among the strong points of the paper I would focus on the following ones: 1. Introduces a general-purpose, training-free framework for frame-level control, addressing a gap between task-specific training-free methods and general-purpose fine-tuning approaches. 2. Rigorous experiments across multiple VDMs (e.g., CogVideoX, Wan, SVD) and tasks, supported by human evaluations and metrics (FID, FVD, CLIP scores). 3. Latent slicing reduces GPU memory usage by up to 60×, enabling application to large
The weak points are: 1. Guidance increases inference time by 2–4×, limiting real-time applicability. 2. Performance is constrained by the base VDM’s capabilities, especially for dynamic or fine-grained content. 3. While latent slicing is justified via experiments on CausalVAE, broader validation across architectures is limited. 4. Although multi-condition guidance is shown, combining losses for complex controls (e.g., motion + style) is not deeply explored. 5. As you know, one of the main proble
1. Originality: While the paper builds on existing concepts like latent optimization, its integration into a training-free video generation framework with frame-level control is novel and practical. 2. Quality: The method is well-engineered and demonstrates compatibility with multiple video diffusion backbones. The results show improved coherence and controllability across tasks. 3. Clarity: The paper is generally well-written and structured. The methodology is explained clearly, and the figures
1. Limited Novelty: The core techniques (e.g., latent optimization and latent sliding) are adaptations of existing methods. The novelty lies more in the integration and application than in the underlying algorithms. The proposed did show improved performance but with basis from its underlying video diffusion model. 2. The latent slicing strategy, while memory-efficient, may be sensitive to frame rate and motion magnitude. In cases of large motion or occlusion, it may fail to maintain coherence.
1. This paper proposes a training-free guidance method that eliminates the need for fine-tuning large-scale video models for controllable generation. 2. This paper introduces a memory-efficient latent processing technique and optimization strategy that enables practical application on large-scale models while ensuring temporal coherence. 3. This paper demonstrates versatility across diverse control tasks including keyframes, stylization, and looping, with compatibility across any video generat
1. The authors mention in the limitations that "The computational cost of guidance sampling is higher than that of training-based methods." Please provide timing comparisons in the rebuttal to better benchmark the method against alternatives. 2. This paper focuses on training-free controllable video generation, but the related works section lacks discussion of previous relevant methods, such as Tune-A-Video[1], Text2Video-Zero[2], and ControlVideo[3]. 3. Can the proposed method be extended to co
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Computer Graphics and Visualization Techniques
