Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection
Mehmet Kerem Turkcan

TL;DR
DART transforms SAM3 into a real-time multi-class detection system by sharing backbone computation across classes, significantly improving speed without retraining, and surpassing some open-vocabulary detectors on COCO.
Contribution
It introduces a training-free method to convert promptable segmentation models into efficient multi-class detectors by exploiting the class-agnostic nature of the visual backbone.
Findings
Achieves 55.8 AP at 15.8 FPS on COCO with 80 classes.
Provides up to 25x speedup at 80 classes compared to single-prompt processing.
Outperforms some purpose-built open-vocabulary detectors without additional training.
Abstract
Recent advances in vision-language modeling have produced promptable detection and segmentation systems that accept arbitrary natural language queries at inference time. Among these, SAM3 achieves state-of-the-art accuracy by combining a ViT-H/14 backbone with cross-modal transformer decoding and learned object queries. However, SAM3 processes a single text prompt per forward pass. Detecting N categories requires N independent executions, each dominated by the 439M-parameter backbone. We present Detect Anything in Real Time (DART), a training-free framework that converts SAM3 into a real-time multi-class detector by exploiting a structural invariant: the visual backbone is class-agnostic, producing image features independent of the text prompt. This allows the backbone computation to be shared between all classes, reducing its cost from O(N) to O(1). Combined with batched multi-class…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
