VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision   Language Models

Gokul Karthik Kumar; Iheb Chaabane; Kebin Wu

arXiv:2502.10250·cs.CL·February 25, 2025

VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models

Gokul Karthik Kumar, Iheb Chaabane, Kebin Wu

PDF

Open Access 2 Datasets

TL;DR

This paper introduces VisCon-100K, a large dataset derived from web documents, to improve vision-language models by leveraging web context and fine-tuning with generated image captions and question-answer pairs.

Contribution

We created VisCon-100K from web data, utilizing GPT-4V and OpenChat 3.5 to generate diverse captions and QA pairs, enhancing VLM performance with contextual web information.

Findings

01

Fine-tuning with VisCon-100K improves benchmark performance.

02

Leaky modality mix outperforms non-leaky approaches.

03

Dataset and tools facilitate scalable future research.

Abstract

Vision-language models (VLMs) excel in various visual benchmarks but are often constrained by the lack of high-quality visual fine-tuning data. To address this challenge, we introduce VisCon-100K, a novel dataset derived from interleaved image-text web documents. Our approach transforms 45K web documents from the OBELICS dataset into 100K image conversation samples. We utilize GPT-4V to generate image-contextual captions and OpenChat 3.5 model to convert these captions into diverse free-form and multiple-choice question-answer pairs. Integrating this dataset for fine-tuning considerably enhances VLM performance across multiple benchmarks. Unlike methods that focus solely on fine-grained visual content, our approach leverages accompanying web context, yielding superior results. We also discover that a 'leaky modality mix', where conversation samples contain questions answerable from both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsFocus