Detect Everything with Few Examples
Xinyu Zhang, Yuhan Liu, Yuting Wang, Abdeslam Boularias

TL;DR
DE-ViT is a novel few-shot object detection method that eliminates the need for finetuning, using a region-propagation mechanism and prototype-based feature projection, achieving state-of-the-art results on multiple benchmarks.
Contribution
Introduces DE-ViT, a finetuning-free few-shot object detector with a new region-propagation architecture and robust prototype projection, advancing the state-of-the-art in few-shot detection.
Findings
Sets new state-of-the-art on Pascal VOC, COCO, and LVIS benchmarks.
Surpasses previous few-shot methods by large margins, e.g., 15 mAP on COCO 10-shot.
Successfully applied to real robot pick-and-place tasks.
Abstract
Few-shot object detection aims at detecting novel categories given only a few example images. It is a basic skill for a robot to perform tasks in open environments. Recent methods focus on finetuning strategies, with complicated procedures that prohibit a wider application. In this paper, we introduce DE-ViT, a few-shot object detector without the need for finetuning. DE-ViT's novel architecture is based on a new region-propagation mechanism for localization. The propagated region masks are transformed into bounding boxes through a learnable spatial integral layer. Instead of training prototype classifiers, we propose to use prototypes to project ViT features into a subspace that is robust to overfitting on base classes. We evaluate DE-ViT on few-shot, and one-shot object detection benchmarks with Pascal VOC, COCO, and LVIS. DE-ViT establishes new state-of-the-art results on all…
Peer Reviews
Decision·CoRL 2024
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques
MethodsFocus · Balanced Selection
