EVCtrl: Efficient Control Adapter for Visual Generation
Zixiang Yang, Yue Ma, Yinhan Zhang, Shanhui Mo, Dongrui Liu, Linfeng Zhang

TL;DR
EVCtrl is a lightweight, plug-and-play control adapter that significantly improves the efficiency of visual generation models by reducing redundant computation and denoising steps without retraining, enabling faster image and video control generation.
Contribution
We introduce EVCtrl, a novel spatio-temporal dual caching strategy that reduces computational overhead in controllable visual generation models without requiring retraining.
Findings
Achieves over 2x speedup on benchmark models
Maintains high quality in image and video generation
Reduces redundant computation in control regions
Abstract
Visual generation includes both image and video generation, training probabilistic models to create coherent, diverse, and semantically faithful content from scratch. While early research focused on unconditional sampling, practitioners now demand controllable generation that allows precise specification of layout, pose, motion, or style. While ControlNet grants precise spatial-temporal control, its auxiliary branch markedly increases latency and introduces redundant computation in both uncontrolled regions and denoising steps, especially for video. To address this problem, we introduce EVCtrl, a lightweight, plug-and-play control adapter that slashes overhead without retraining the model. Specifically, we propose a spatio-temporal dual caching strategy for sparse control information. For spatial redundancy, we first profile how each layer of DiT-ControlNet responds to fine-grained…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
**Strengths:** - Proposes a **clear and practical approach** that effectively targets spatial and temporal redundancy in controllable diffusion models. - **Training-free and plug-and-play**, making it highly practical for real-world deployment and easily integrated into existing ControlNet and DiT pipelines. - **Comprehensive experiments** across text-to-image and text-to-video tasks with multiple baselines (Fora, ToCa, Taylorseer) strongly support the claimed acceleration and quality pres
Weaknesses: 1. The paper’s novelty is **incremental**, mainly improving upon existing caching or skipping techniques (e.g., ToCa, Duca) rather than proposing a fundamentally new principle. 2. The description of **critical-step detection** in DSS lacks clarity; how “critical timesteps” are selected or tuned is somewhat underexplained. 3. Most experiments emphasize efficiency, but the **effect on temporal consistency and perceptual coherence** (especially in long video sequences) is not suff
1. The motivation of this paper is clear, i.e., addressing the temporal and spatial redundancy caused by utilizing ControlNet. 2. The method is training-free and easy to implement, requiring no retraining. The LFoC component attempts to explore the internal working principles of DiT-ControlNet by analyzing the functions of different layers via L1 norm, which is a promising approach to optimize ControlNet. 3. Extensive experiments have been conducted, both qualitatively and quantitatively, consis
1. The DSS (temporal acceleration) mechanism is insufficient explained, lacking crucial details for reproducibility. Terms like "identified a priori" and "predetermined sequence of critical steps" are used without explaining the specific criteria for screening these steps or specifying the quantity $m$. This missing information is vital. 2. The methodology is not very novel. The core idea of DSS is similar to prior work like FORA, where FORA has already verified the effectiveness of exploiting t
1. High practicality: zero training cost, plug-and-play deployment. 2. Clear redundancy modeling: separates spatial (LFoC) and temporal (DSS) redundancy. 3. Comprehensive experiments: multiple models and control conditions, significant acceleration. 4. Quality preservation: maintains visual metrics comparable to the original ControlNet under 2.16 times speed-up.
1. Limited generalization: no tests on high-resolution, long-video, or complex control scenarios. 2. Narrow comparison: only evaluated against training-free acceleration baselines; Insufficient comparison with other types of controllers or acceleration methods. 3. Shallow theoretical analysis: lacks formal discussion on why and when LFoC and DSS work or fail.
- The method achieves higher speedup and better metrics with respect to the chosen baselines - The small amount of showcased qualitative results shows some cases where the method is successful while baselines are failing. - The analysis of token magnitudes, correlating high magnitude tokens to contour regions is interesting
- The amount of provided qualitative results in the form of videos is limited to 6 samples in the supplementary power point presentation, making it difficult to qualitatively judge the quality of the method. - The method relies on "Local Focused Caching (LFoC) to achieve its speedup and performance. LFoC relies on the assumption that control signals are sparse and mostly consisting of spatially uninformative regions (e.g. poses, edges). Usage of dense control signals such as depth maps should be
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Image Enhancement Techniques
