D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation
Nobline Yoo, Olga Russakovsky, Ye Zhu

TL;DR
This paper introduces D2D, a framework that converts non-differentiable object detectors into differentiable critics to improve numeracy in text-to-image models, significantly enhancing counting accuracy without sacrificing quality.
Contribution
The paper presents a novel method to leverage detector-based models as differentiable critics, enabling better object counting guidance in text-to-image generation.
Findings
Up to 13.7% improvement in counting accuracy on low-density benchmarks.
Consistent performance gains across multiple datasets and scenarios.
Minimal impact on image quality and computational efficiency.
Abstract
Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts. Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently differentiable, thus excluding detector-based models with superior counting ability, whose count-via-enumeration nature is non-differentiable. To overcome this limitation, we propose Detector-to-Differentiable (D2D), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to…
Peer Reviews
Decision·Submitted to ICLR 2026
- **Novel reward function design:** Converting non-differentiable detectors into differentiable counting rewards through steep sigmoids and logit scaling (Eq. 1-2) addresses a limitation where detectors outperform regression models in low-density counting but couldn't previously be used for gradient-based optimization. - **Two complementary contributions:** The differentiable detector-based reward ($\mathcal{L}_{D2D}$) and the LMN architecture for reward optimization. Table 5 shows the LMN impro
- **Limited Discussion of General-Purpose Application**: The methodology is tailored for prompts containing numerical targets, but the paper does not discuss how the framework should behave with general, non-numeric prompts. It is unclear whether the D2D optimization is intended to be selectively activated, or what the default behavior might be in the absence of a numerical target (e.g., "few", "some", or "many"). How about very long complex prompts with lots of content. Additionally, it's not r
1. The proposed method is logical and well-motivated, with clear explanations that make the underlying rationale and implementation easy to understand. 2. The analysis of the proposed framework is in-depth: for instance, the comparison between detector-based and regression-based counting models (Figure 2), the exploration of class-wise performance (Figure 4), and the ablation between direct noise optimization and LMN-based methods (Table 5). 3. The demonstrated generality of the proposed metho
1. The evaluation relies solely on a detector-based protocol to assess the proposed detector-based critic. As noted around line 315, the paper employs the SOTA counting model CountGD (built upon GroundingDINO). Although the proposed critic uses different detectors such as OWL-ViT or YOLO for initial noise optimization, this setup risks giving an unfair advantage aligned with the evaluation criterion. To more robustly validate the superiority of the method, additional evaluation metrics, such as
The biggest strength of the paper is in constructing an effective objective for differentiable optimization. This ensures that there's effective inference-time optimization, which is visible from the strong empirical results on several counting benchmarks. The paper is also well-presented and easy to follow.
Avoids more recent larger models: The results in the paper are on the SD-Turbo, SDXL-Turbo and Pixart-alpha models which are all not only distilled (therefore a bit worse in performance), but also fairly out of date as of late 2025 (all being released between late 2023-early 2024). While some of the newer models may not be applicable, one could still see results on Flux-Schnell, SANA-Sprint, SD3.5-Turbo to see how effective this optimization framework is on newer problems. While I'd expect the c
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Cell Image Analysis Techniques
