TL;DR
InnerControl introduces a novel training strategy that enforces spatial consistency at all diffusion steps by reconstructing control signals from intermediate features, significantly enhancing control fidelity and image quality in text-to-image diffusion models.
Contribution
It proposes a new training approach that uses lightweight probes to enforce intermediate feature consistency, improving control accuracy over existing methods like ControlNet++.
Findings
Achieves state-of-the-art control fidelity across various conditioning methods.
Improves image generation quality by maintaining spatial consistency during diffusion.
Effectively reconstructs control signals from noisy latents at all steps.
Abstract
Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions…
Peer Reviews
Decision·Submitted to ICLR 2026
- This paper observes that prior approaches only focus on the final generation results, leading to slow and delayed feedback during training. - To address this, the paper introduces the prediction of a pseudo \\( x_0 \\), which is then decoded into an image via a VAE. This enables the extension of the consistency loss to every diffusion step.
* the cost is expensive because we have to do the vae decode for each step. * the contribution compared with controlnet++ is only the step level feedback, which is a trival trick since 2023. * the improvement is marginal.
1. The paper's core strength is its clear identification and experimental validation of why prior reward losses (like ControlNet++) fail. The analysis showing that the signal source (the blurry one-step $x_0'$ prediction) is the root cause of early-step instability is a critical insight. 2. The idea of using the UNet's "inner voice" (intermediate features) is an elegant solution. The pre-trained probe ($\mathbb{H}$) is shown to be a far more robust signal extractor than standard models (like DPT
1. The proposed method requires a new, non-trivial pre-training stage for the $\mathbb{H}$ probe. This must be done for every control type (depth, segmentation, HED, etc.), and each probe requires its own specific training dataset (e.g., ADE20K for segmentation). This increases the overall pipeline complexity and data requirements. The paper could be strengthened by discussing the generalization of these probes or a more data-efficient way to train them. 2. The final loss function is a complex c
The paper’s core contribution—leveraging intermediate diffusion features for control alignment across all denoising steps—is both novel and timely. While prior works like ControlNet++ and CTRL-U focused on late-stage alignment via reward losses, this work identifies and addresses a critical temporal gap: the early denoising stages, where spatial structure emerges but is overlooked. The idea of training lightweight, timestep-conditioned probes to extract control signals from noisy intermediate fe
The alignment module is essentially a “conv head on U-Net decoder” whose structure and channel-fusion logic are borrowed wholesale from Readout Guidance (Readout Guidance: Learning Control from Diffusion Features). The only new twist is timestep conditioning, which is already standard in diffusion literature. Consequently, the architectural contribution feels thin. All four tested conditions (depth, HED, LineArt, segmentation) are dense, pixel-to-pixel maps. The paper never tackles sparse or
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
