OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration
Ruitong Sun, Tianze Yang, Wei Niu, Jin Sun

TL;DR
OUSAC is a novel framework that accelerates diffusion transformer image generation by optimizing guidance scheduling and adaptive caching, reducing computation significantly while maintaining or improving image quality.
Contribution
It introduces a two-stage optimization approach combining evolutionary algorithms and adaptive rank allocation to effectively skip unnecessary computations in diffusion models.
Findings
Achieves up to 82% reduction in unconditional passes.
Delivers 53% computational savings with 15% quality improvement on DiT-XL/2.
Provides 5x speedup with better CLIP scores on benchmark datasets.
Abstract
Diffusion models have emerged as the dominant paradigm for high-quality image generation, yet their computational expense remains substantial due to iterative denoising. Classifier-Free Guidance (CFG) significantly enhances generation quality and controllability but doubles the computation by requiring both conditional and unconditional forward passes at every timestep. We present OUSAC (Optimized gUidance Scheduling with Adaptive Caching), a framework that accelerates diffusion transformers (DiT) through systematic optimization. Our key insight is that variable guidance scales enable sparse computation: adjusting scales at certain timesteps can compensate for skipping CFG at others, enabling both fewer total sampling steps and fewer CFG steps while maintaining quality. However, variable guidance patterns introduce denoising deviations that undermine standard caching methods, which…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is clearly written and the motivation is easy to follow. 2. Integrating guidance scheduling and caching is a practical and coherent idea. 3. Results on DiT and PixArt-α show consistent speedup with comparable quality. 4. The approach is training-free and could be added to existing DiT models.
1. Limited relevance: The method assumes Classifier-Free Guidance, but many recent models (e.g., FLUX) do not use CFG at all. The paper does not explain how OUSAC would work in those settings. 2. Missing search-cost details: Stage-1 evolutionary search likely requires many full generations, but the paper gives no runtime, GPU hours, or total evaluations. 3. Generalization claims unproven: The paper claims that the discovered guidance schedule “generalizes across different prompts and condition
1. **Practical, training-free approach.** The method operates post-hoc on pretrained DiT models (no extra model training), which is attractive for adoption. The two stages are performed offline once per model and then used at inference. 2. **Diverse experiments.** Evaluation on large, realistic models/datasets (DiT-XL/2 on ImageNet 256/512; PixArt-α on MSCOCO) with standard metrics (FID, IS, sFID, CLIP score) and multiple baselines (ICC, L2C, Harmonica, DDIM) gives the paper empirical brea
- **Missing/insufficient reporting of optimization cost & reproducibility concerns.** The method’s Stage-1 uses evolutionary optimization (population sampling, multiple generations) to search a T-dimensional schedule; Stage-2 uses coordinate descent over region ranks. The paper gives search hyperparameters (e.g., 15 generations for DiT) but does not report the wall-clock compute, GPU hours, or search cost needed to discover schedules and rank assignments nor whether the search is practical for l
1. OUSAC is the first to recognize and systematically handle the interdependence between variable guidance and cache calibration. The use of an evolutionary optimization framework for discovering sparse guidance schedules is particularly original, allowing training-free, gradient-free optimization over a hybrid discrete–continuous search space. 2. The methodological formulation is technically solid and well-motivated. Stage-1’s optimization objective is rigorously defined, balancing fidelity an
1. Limited evaluation diversity and generalization scope. Although the paper demonstrates consistent improvements on DiT-XL/2 (ImageNet) and PixArt-α (MSCOCO), both models share a diffusion-transformer backbone and similar sampling dynamics. It remains unclear whether OUSAC’s sparse scheduling and adaptive caching generalize to recent state-of-the-art text-to-image models, such as FLUX.1-dev, Stable Diffusion 3.5 (SD3), or Qwen-Image, which feature substantially different architectures, noise s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Enhancement Techniques · Image and Video Quality Assessment · Advanced Data Compression Techniques
