EVCtrl: Efficient Control Adapter for Visual Generation

Zixiang Yang; Yue Ma; Yinhan Zhang; Shanhui Mo; Dongrui Liu; Linfeng Zhang

arXiv:2508.10963·cs.CV·December 9, 2025

EVCtrl: Efficient Control Adapter for Visual Generation

Zixiang Yang, Yue Ma, Yinhan Zhang, Shanhui Mo, Dongrui Liu, Linfeng Zhang

PDF

Open Access 4 Reviews

TL;DR

EVCtrl is a lightweight, plug-and-play control adapter that significantly improves the efficiency of visual generation models by reducing redundant computation and denoising steps without retraining, enabling faster image and video control generation.

Contribution

We introduce EVCtrl, a novel spatio-temporal dual caching strategy that reduces computational overhead in controllable visual generation models without requiring retraining.

Findings

01

Achieves over 2x speedup on benchmark models

02

Maintains high quality in image and video generation

03

Reduces redundant computation in control regions

Abstract

Visual generation includes both image and video generation, training probabilistic models to create coherent, diverse, and semantically faithful content from scratch. While early research focused on unconditional sampling, practitioners now demand controllable generation that allows precise specification of layout, pose, motion, or style. While ControlNet grants precise spatial-temporal control, its auxiliary branch markedly increases latency and introduces redundant computation in both uncontrolled regions and denoising steps, especially for video. To address this problem, we introduce EVCtrl, a lightweight, plug-and-play control adapter that slashes overhead without retraining the model. Specifically, we propose a spatio-temporal dual caching strategy for sparse control information. For spatial redundancy, we first profile how each layer of DiT-ControlNet responds to fine-grained…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

**Strengths:** - Proposes a **clear and practical approach** that effectively targets spatial and temporal redundancy in controllable diffusion models. - **Training-free and plug-and-play**, making it highly practical for real-world deployment and easily integrated into existing ControlNet and DiT pipelines. - **Comprehensive experiments** across text-to-image and text-to-video tasks with multiple baselines (Fora, ToCa, Taylorseer) strongly support the claimed acceleration and quality pres

Weaknesses

Weaknesses: 1. The paper’s novelty is **incremental**, mainly improving upon existing caching or skipping techniques (e.g., ToCa, Duca) rather than proposing a fundamentally new principle. 2. The description of **critical-step detection** in DSS lacks clarity; how “critical timesteps” are selected or tuned is somewhat underexplained. 3. Most experiments emphasize efficiency, but the **effect on temporal consistency and perceptual coherence** (especially in long video sequences) is not suff

Reviewer 02Rating 4Confidence 4

Strengths

1. The motivation of this paper is clear, i.e., addressing the temporal and spatial redundancy caused by utilizing ControlNet. 2. The method is training-free and easy to implement, requiring no retraining. The LFoC component attempts to explore the internal working principles of DiT-ControlNet by analyzing the functions of different layers via L1 norm, which is a promising approach to optimize ControlNet. 3. Extensive experiments have been conducted, both qualitatively and quantitatively, consis

Weaknesses

1. The DSS (temporal acceleration) mechanism is insufficient explained, lacking crucial details for reproducibility. Terms like "identified a priori" and "predetermined sequence of critical steps" are used without explaining the specific criteria for screening these steps or specifying the quantity $m$. This missing information is vital. 2. The methodology is not very novel. The core idea of DSS is similar to prior work like FORA, where FORA has already verified the effectiveness of exploiting t

Reviewer 03Rating 8Confidence 4

Strengths

1. High practicality: zero training cost, plug-and-play deployment. 2. Clear redundancy modeling: separates spatial (LFoC) and temporal (DSS) redundancy. 3. Comprehensive experiments: multiple models and control conditions, significant acceleration. 4. Quality preservation: maintains visual metrics comparable to the original ControlNet under 2.16 times speed-up.

Weaknesses

1. Limited generalization: no tests on high-resolution, long-video, or complex control scenarios. 2. Narrow comparison: only evaluated against training-free acceleration baselines; Insufficient comparison with other types of controllers or acceleration methods. 3. Shallow theoretical analysis: lacks formal discussion on why and when LFoC and DSS work or fail.

Reviewer 04Rating 4Confidence 4

Strengths

- The method achieves higher speedup and better metrics with respect to the chosen baselines - The small amount of showcased qualitative results shows some cases where the method is successful while baselines are failing. - The analysis of token magnitudes, correlating high magnitude tokens to contour regions is interesting

Weaknesses

- The amount of provided qualitative results in the form of videos is limited to 6 samples in the supplementary power point presentation, making it difficult to qualitatively judge the quality of the method. - The method relies on "Local Focused Caching (LFoC) to achieve its speedup and performance. LFoC relies on the assumption that control signals are sparse and mostly consisting of spatially uninformative regions (e.g. poses, edges). Usage of dense control signals such as depth maps should be

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Image Enhancement Techniques