GenDet: Painting Colored Bounding Boxes on Images via Diffusion Model for Object Detection
Chen Min, Chengyang Li, Fanjie Kong, Qi Zhu, Dawei Zhao, Liang Xiao

TL;DR
GenDet introduces a novel object detection framework that formulates detection as an image generation task using a diffusion model, enabling precise bounding box and category prediction within a generative paradigm.
Contribution
It pioneers the use of large-scale pre-trained diffusion models for object detection, integrating semantic constraints for accurate bounding box generation in the image space.
Findings
Achieves competitive accuracy with traditional detectors
Provides flexible and controllable detection outputs
Bridges generative models with discriminative detection tasks
Abstract
This paper presents GenDet, a novel framework that redefines object detection as an image generation task. In contrast to traditional approaches, GenDet adopts a pioneering approach by leveraging generative modeling: it conditions on the input image and directly generates bounding boxes with semantic annotations in the original image space. GenDet establishes a conditional generation architecture built upon the large-scale pre-trained Stable Diffusion model, formulating the detection task as semantic constraints within the latent space. It enables precise control over bounding box positions and category attributes, while preserving the flexibility of the generative model. This novel methodology effectively bridges the gap between generative models and discriminative tasks, providing a fresh perspective for constructing unified visual understanding systems. Systematic experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Multimodal Machine Learning Applications
