Enhancing Multimodal Compositional Reasoning of Visual Language Models   with Generative Negative Mining

Ugur Sahin; Hang Li; Qadeer Khan; Daniel Cremers; Volker Tresp

arXiv:2311.03964·cs.CV·November 8, 2023·1 cites

Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining

Ugur Sahin, Hang Li, Qadeer Khan, Daniel Cremers, Volker Tresp

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a novel framework that generates challenging negative samples in both image and text modalities to improve the compositional reasoning abilities of visual language models, addressing limitations of traditional contrastive training.

Contribution

It proposes a dual-direction negative sample mining and generation approach, significantly enhancing VLMs' performance on complex multimodal reasoning tasks.

Findings

01

Improved accuracy on compositional reasoning benchmarks

02

Enhanced discrimination of complex image-text interactions

03

Effective negative sample generation in both modalities

Abstract

Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ugursahin/generative-negative-mining-dataset
dataset· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsFocus