Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

Nina Konovalova; Maxim Nikolaev; Andrey Kuznetsov; Aibek Alanov

arXiv:2507.02321·cs.CV·July 4, 2025

Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

Nina Konovalova, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov

PDF

3 Reviews

TL;DR

InnerControl introduces a novel training strategy that enforces spatial consistency at all diffusion steps by reconstructing control signals from intermediate features, significantly enhancing control fidelity and image quality in text-to-image diffusion models.

Contribution

It proposes a new training approach that uses lightweight probes to enforce intermediate feature consistency, improving control accuracy over existing methods like ControlNet++.

Findings

01

Achieves state-of-the-art control fidelity across various conditioning methods.

02

Improves image generation quality by maintaining spatial consistency during diffusion.

03

Effectively reconstructs control signals from noisy latents at all steps.

Abstract

Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- This paper observes that prior approaches only focus on the final generation results, leading to slow and delayed feedback during training. - To address this, the paper introduces the prediction of a pseudo \$ x_0 \$, which is then decoded into an image via a VAE. This enables the extension of the consistency loss to every diffusion step.

Weaknesses

* the cost is expensive because we have to do the vae decode for each step. * the contribution compared with controlnet++ is only the step level feedback, which is a trival trick since 2023. * the improvement is marginal.

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper's core strength is its clear identification and experimental validation of why prior reward losses (like ControlNet++) fail. The analysis showing that the signal source (the blurry one-step $x_0'$ prediction) is the root cause of early-step instability is a critical insight. 2. The idea of using the UNet's "inner voice" (intermediate features) is an elegant solution. The pre-trained probe ($\mathbb{H}$) is shown to be a far more robust signal extractor than standard models (like DPT

Weaknesses

1. The proposed method requires a new, non-trivial pre-training stage for the $\mathbb{H}$ probe. This must be done for every control type (depth, segmentation, HED, etc.), and each probe requires its own specific training dataset (e.g., ADE20K for segmentation). This increases the overall pipeline complexity and data requirements. The paper could be strengthened by discussing the generalization of these probes or a more data-efficient way to train them. 2. The final loss function is a complex c

Reviewer 03Rating 4Confidence 3

Strengths

The paper’s core contribution—leveraging intermediate diffusion features for control alignment across all denoising steps—is both novel and timely. While prior works like ControlNet++ and CTRL-U focused on late-stage alignment via reward losses, this work identifies and addresses a critical temporal gap: the early denoising stages, where spatial structure emerges but is overlooked. The idea of training lightweight, timestep-conditioned probes to extract control signals from noisy intermediate fe

Weaknesses

The alignment module is essentially a “conv head on U-Net decoder” whose structure and channel-fusion logic are borrowed wholesale from Readout Guidance (Readout Guidance: Learning Control from Diffusion Features). The only new twist is timestep conditioning, which is already standard in diffusion literature. Consequently, the architectural contribution feels thin. All four tested conditions (depth, HED, LineArt, segmentation) are dense, pixel-to-pixel maps. The paper never tackles sparse or

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.