FineVision: Open Data Is All You Need

Luis Wiedmann; Orr Zohar; Amir Mahla; Xiaohan Wang; Rui Li; Thibaud Frere; Leandro von Werra; Aritra Roy Gosthipaty; Andr\'es Marafioti

arXiv:2510.17269·cs.CV·May 21, 2026

FineVision: Open Data Is All You Need

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andr\'es Marafioti

PDF

7 Datasets

TL;DR

FineVision is a large, carefully curated open dataset of 24 million vision-language samples, designed to improve the training and evaluation of vision-language models through rigorous data collection and cleaning.

Contribution

The paper introduces FineVision, the largest unified, high-quality open dataset for vision-language models, with a semi-automated, human-in-the-loop curation process.

Findings

01

Models trained on FineVision outperform those trained on other open datasets.

02

FineVision's data hygiene and scale lead to better model performance.

03

The dataset and tools are publicly released to support future research.

Abstract

The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Ethics and Social Impacts of AI