Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets

Rafi Al Attrach; Rajna Fani; Sebastian Lobentanzer; Joan Giner-Miguelez; Debanshu Das; Varuni H. K.; Nobin Sarwar; Rajat Ghosh; Anwai Archit; Surbhi Motghare; Christina Conrad Parry; Luis Oala; Lara Grosso; Joaquin Vanschoren; Steffen Vogler; Sujata Goswami; Eric S. Rosenthal; Marzyeh Ghassemi; Matthew McDermott; Tom Pollard

arXiv:2605.15079·cs.LG·May 15, 2026

Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets

Rafi Al Attrach, Rajna Fani, Sebastian Lobentanzer, Joan Giner-Miguelez, Debanshu Das, Varuni H. K., Nobin Sarwar, Rajat Ghosh, Anwai Archit, Surbhi Motghare, Christina Conrad Parry, Luis Oala, Lara Grosso, Joaquin Vanschoren, Steffen Vogler, Sujata Goswami, Eric S. Rosenthal

PDF

TL;DR

Croissant Baker is an open-source tool that generates validated Croissant metadata locally from datasets, enabling better discovery, governance, and reuse of ML datasets, especially for large or private repositories.

Contribution

It introduces Croissant Baker, a local-first command-line tool that automates Croissant metadata generation with high accuracy for diverse datasets.

Findings

01

Achieved 97-100% agreement with ground truth on multiple datasets.

02

Successfully scaled to MIMIC-IV with 886 million rows and 374 Parquet files.

03

Supports over 140 datasets with validated metadata generation.

Abstract

Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.