GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning
Jiayi Lin, Shaogang Gong

TL;DR
GridCLIP introduces a one-stage object detection method that leverages grid-level CLIP representations, achieving near two-stage detector performance with significantly faster training and inference, especially improving detection of infrequent categories.
Contribution
It proposes a novel grid-level alignment approach for one-stage detection using CLIP, narrowing the performance gap with two-stage detectors while reducing computational costs.
Findings
Achieves comparable performance to two-stage detectors on LVIS benchmark.
Significantly faster training and inference times than two-stage counterparts.
Improves detection of undersampled and novel categories.
Abstract
A vision-language foundation model pretrained on very large-scale image-text paired data has the potential to provide generalizable knowledge representation for downstream visual recognition and detection tasks, especially on supplementing the undersampled categories in downstream model training. Recent studies utilizing CLIP for object detection have shown that a two-stage detector design typically outperforms a one-stage detector, while requiring more expensive training resources and longer inference time. In this work, we propose a one-stage detector GridCLIP that narrows its performance gap to those of two-stage detectors, with approximately 43 and 5 times faster than its two-stage counterpart (ViLD) in the training and test process respectively. GridCLIP learns grid-level representations to adapt to the intrinsic principle of one-stage detection learning by expanding the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsTest · Contrastive Language-Image Pre-training
