RTGen: Real-Time Generative Detection Transformer

Chi Ruan; Jiying Zhao; Wenhu Chen

arXiv:2502.20622·cs.CV·November 18, 2025

RTGen: Real-Time Generative Detection Transformer

Chi Ruan, Jiying Zhao, Wenhu Chen

PDF

TL;DR

RTGen introduces a fast, non-autoregressive generative object detector that directly generates category names from detection labels, achieving real-time performance without external language model dependencies.

Contribution

The paper presents RTGen, a novel real-time generative detection transformer with a unified encoder-decoder architecture and a DAG-structured decoder for efficient, non-autoregressive category name generation.

Findings

01

RTGen-R34 achieves 131.3 FPS on T4 GPUs.

02

RTGen is over 270x faster than GenerateU.

03

Models learn to generate category names without external supervision.

Abstract

Although open-vocabulary object detectors can generalize to unseen categories, they still rely on predefined textual prompts or classifier heads during inference. Recent generative object detectors address this limitation by coupling an autoregressive language model with a detector backbone, enabling direct category name generation for each detected object. However, this straightforward design introduces structural redundancy and substantial latency. In this paper, we propose a Real-Time Generative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder) that jointly decodes visual and textual representations within a unified framework. The textual side is organized as a Directed Acyclic Graph (DAG), enabling non-autoregressive category naming. Benefiting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.