Weak-to-Strong Compositional Learning from Generative Models for   Language-based Object Detection

Kwanyong Park; Kuniaki Saito; Donghyun Kim

arXiv:2407.15296·cs.CV·July 23, 2024

Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection

Kwanyong Park, Kuniaki Saito, Donghyun Kim

PDF

Open Access

TL;DR

This paper introduces a novel synthetic data generation method leveraging generative models to improve the compositional understanding of vision-language models in language-based object detection, significantly boosting performance on benchmark datasets.

Contribution

It proposes a structured synthetic data generation framework and a new contrastive learning formulation to enhance VL models' compositional understanding, transforming weaker models into stronger ones.

Findings

01

Up to +5AP improvement on Omnilabel benchmark

02

Up to +6.9AP improvement on D3 benchmark

03

Effective enhancement of VL models' understanding of complex language descriptions

Abstract

Vision-language (VL) models often exhibit a limited understanding of complex expressions of visual objects (e.g., attributes, shapes, and their relations), given complex and diverse language queries. Traditional approaches attempt to improve VL models using hard negative synthetic text, but their effectiveness is limited. In this paper, we harness the exceptional compositional understanding capabilities of generative foundational models. We introduce a novel method for structured synthetic data generation aimed at enhancing the compositional understanding of VL models in language-based object detection. Our framework generates densely paired positive and negative triplets (image, text descriptions, and bounding boxes) in both image and text domains. By leveraging these synthetic triplets, we transform 'weaker' VL models into 'stronger' models in terms of compositional understanding, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsContrastive Learning