20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

DatologyAI: Siddharth Joshi; Haoli Yin; Rishabh Adiga; Haakon Mongstad; Alvin Deng; Aldo Carranza; Alex Fang; Amro Abbas; Anshuman Suri; Brett Larsen; Daniel Zayas; Darren Teh; David Schwab; Diego Kiner; Fan Pan; Jack Urbanek; Jason Lee; Jason Telanoff; Josh Wills; Kaleigh Mentzer; Luke Merrick; Maximilian B\"other; Parth Doshi; Paul Burstein; Pratyush Maini; Ties Robroek; Tony Jiang; Vidhi Jain; Vineeth Dorna; Zhengping Wang; Bogdan Gaza; Ari Morcos; Matthew Leavitt

arXiv:2605.11405·cs.LG·May 14, 2026

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

DatologyAI: Siddharth Joshi, Haoli Yin, Rishabh Adiga, Haakon Mongstad, Alvin Deng, Aldo Carranza, Alex Fang, Amro Abbas, Anshuman Suri, Brett Larsen, Daniel Zayas, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills

PDF

TL;DR

This paper demonstrates that data curation alone can significantly improve vision-language models (VLMs), achieving near-frontier accuracy with substantially less training compute across multiple benchmarks.

Contribution

It shows that careful data curation can enhance VLM performance, reliability, and efficiency without changing architecture or training recipes.

Findings

01

Curated models outperform baseline models on 20 VLM benchmarks.

02

Curated 2B model surpasses InternVL3.5-2B by 9.9pp at 17x less compute.

03

Data curation improves out-of-distribution generalization and inference cost efficiency.

Abstract

Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take VLM performance, holding architecture, training recipe, and compute fixed and varying only the training data. Our pipeline, applied to the MAmmoTH-VL single-image subset, lifts performance by +11.7pp on average across 20 public VLM benchmarks (spanning grounding, VQA, OCR/documents, captioning, spatial/3D, counting, charts, math, brand-ID, and multi-image reasoning) and by +11.3pp on average across all nine capability axes of DatBench, our high-fidelity VLM eval suite. At 2B, our curated model surpasses InternVL3.5-2B by 9.9pp at ~17x less training compute and closes the gap to Qwen3-VL-2B to within 1.8pp at ~87x less compute, from pretraining alone. Beyond…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.