Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting
Frank Ruis, Gertjan Burghouts, Hugo Kuijf

TL;DR
This paper introduces Textual Inversion for open-vocabulary object detection, enabling efficient adaptation to new objects with minimal examples while preserving the original model's zero-shot capabilities.
Contribution
It proposes a novel method to extend VLM vocabulary via token learning, avoiding model fine-tuning and retaining zero-shot performance.
Findings
Outperforms baseline methods that suffer from forgetting.
Requires significantly less compute than full fine-tuning.
Maintains original zero-shot capabilities after adaptation.
Abstract
Recent progress in large pre-trained vision language models (VLMs) has reached state-of-the-art performance on several object detection benchmarks and boasts strong zero-shot capabilities, but for optimal performance on specific targets some form of finetuning is still necessary. While the initial VLM weights allow for great few-shot transfer learning, this usually involves the loss of the original natural language querying and zero-shot capabilities. Inspired by the success of Textual Inversion (TI) in personalizing text-to-image diffusion models, we propose a similar formulation for open-vocabulary object detection. TI allows extending the VLM vocabulary by learning new or improving existing tokens to accurately detect novel or fine-grained objects from as little as three examples. The learned tokens are completely compatible with the original VLM weights while keeping them frozen,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
