Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
Kwanyong Park, Kuniaki Saito, Donghyun Kim

TL;DR
This paper introduces a novel synthetic data generation method leveraging generative models to improve the compositional understanding of vision-language models in language-based object detection, significantly boosting performance on benchmark datasets.
Contribution
It proposes a structured synthetic data generation framework and a new contrastive learning formulation to enhance VL models' compositional understanding, transforming weaker models into stronger ones.
Findings
Up to +5AP improvement on Omnilabel benchmark
Up to +6.9AP improvement on D3 benchmark
Effective enhancement of VL models' understanding of complex language descriptions
Abstract
Vision-language (VL) models often exhibit a limited understanding of complex expressions of visual objects (e.g., attributes, shapes, and their relations), given complex and diverse language queries. Traditional approaches attempt to improve VL models using hard negative synthetic text, but their effectiveness is limited. In this paper, we harness the exceptional compositional understanding capabilities of generative foundational models. We introduce a novel method for structured synthetic data generation aimed at enhancing the compositional understanding of VL models in language-based object detection. Our framework generates densely paired positive and negative triplets (image, text descriptions, and bounding boxes) in both image and text domains. By leveraging these synthetic triplets, we transform 'weaker' VL models into 'stronger' models in terms of compositional understanding, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsContrastive Learning
