F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language   Models

Weicheng Kuo; Yin Cui; Xiuye Gu; AJ Piergiovanni; Anelia Angelova

arXiv:2209.15639·cs.CV·February 27, 2023·37 cites

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova

PDF

Open Access 1 Repo

TL;DR

F-VLM introduces a simplified open-vocabulary object detection method that leverages frozen vision and language models, achieving state-of-the-art results with minimal training and computational resources.

Contribution

The paper proposes F-VLM, a novel approach that uses frozen vision-language models for open-vocabulary detection, eliminating complex training pipelines and improving performance.

Findings

01

+6.5 mask AP over previous state-of-the-art on LVIS

02

Strong performance on COCO open-vocabulary detection

03

Significant training speed-up and compute savings

Abstract

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/google-research/tree/master/fvlm
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsKnowledge Distillation