RTGen: Real-Time Generative Detection Transformer
Chi Ruan, Jiying Zhao, Wenhu Chen

TL;DR
RTGen introduces a fast, non-autoregressive generative object detector that directly generates category names from detection labels, achieving real-time performance without external language model dependencies.
Contribution
The paper presents RTGen, a novel real-time generative detection transformer with a unified encoder-decoder architecture and a DAG-structured decoder for efficient, non-autoregressive category name generation.
Findings
RTGen-R34 achieves 131.3 FPS on T4 GPUs.
RTGen is over 270x faster than GenerateU.
Models learn to generate category names without external supervision.
Abstract
Although open-vocabulary object detectors can generalize to unseen categories, they still rely on predefined textual prompts or classifier heads during inference. Recent generative object detectors address this limitation by coupling an autoregressive language model with a detector backbone, enabling direct category name generation for each detected object. However, this straightforward design introduces structural redundancy and substantial latency. In this paper, we propose a Real-Time Generative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder) that jointly decodes visual and textual representations within a unified framework. The textual side is organized as a Directed Acyclic Graph (DAG), enabling non-autoregressive category naming. Benefiting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
