VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models
Gokul Karthik Kumar, Iheb Chaabane, Kebin Wu

TL;DR
This paper introduces VisCon-100K, a large dataset derived from web documents, to improve vision-language models by leveraging web context and fine-tuning with generated image captions and question-answer pairs.
Contribution
We created VisCon-100K from web data, utilizing GPT-4V and OpenChat 3.5 to generate diverse captions and QA pairs, enhancing VLM performance with contextual web information.
Findings
Fine-tuning with VisCon-100K improves benchmark performance.
Leaky modality mix outperforms non-leaky approaches.
Dataset and tools facilitate scalable future research.
Abstract
Vision-language models (VLMs) excel in various visual benchmarks but are often constrained by the lack of high-quality visual fine-tuning data. To address this challenge, we introduce VisCon-100K, a novel dataset derived from interleaved image-text web documents. Our approach transforms 45K web documents from the OBELICS dataset into 100K image conversation samples. We utilize GPT-4V to generate image-contextual captions and OpenChat 3.5 model to convert these captions into diverse free-form and multiple-choice question-answer pairs. Integrating this dataset for fine-tuning considerably enhances VLM performance across multiple benchmarks. Unlike methods that focus solely on fine-grained visual content, our approach leverages accompanying web context, yielding superior results. We also discover that a 'leaky modality mix', where conversation samples contain questions answerable from both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsFocus
