ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation

William Brach; Francesco Zuppichini; Marco Vinciguerra; Lorenzo Padoan

arXiv:2602.15189·cs.IR·May 11, 2026

ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation

William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan

PDF

1 Models

TL;DR

This paper introduces ScrapeGraphAI-100k, a large, real-world dataset of schema-constrained extraction events for training and benchmarking language models in structured output tasks.

Contribution

It provides a substantial, diverse dataset from real practitioner workloads, addressing limitations of synthetic and text-only datasets for schema-constrained generation.

Findings

01

A 1.7B fine-tuned model closely matches GPT-5-nano's output distribution.

02

The dataset covers 18,000+ schemas across 15 languages, with high coverage of English and Chinese.

03

Schema complexity impacts model performance, revealing sharp failure thresholds.

Abstract

Producing output that conforms to a specified JSON schema underlies tool use, structured extraction, and knowledge base construction in modern large language models. Despite this centrality, public datasets for the task remain small, synthetic, or text-only, and rarely pair real page content with the prompts and schemas used in practice. We introduce ScrapeGraphAI-100k, 93,695 schema-constrained extraction events collected via opt-in ScrapeGraphAI telemetry in Q2--Q3 2025, deduplicated and balanced by schema from 9M raw events. The corpus spans 18 000+ unique schemas across 15 named languages plus a long-tail Other category, with English and Traditional Chinese covering 88% of detected content, each instance pairs Markdown-converted page content with a prompt, schema, LLM response, and per-example jsonschema-rs structural conformance labels (semantic correctness is out of scope, and raw…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
sukritvemula/WebScrapeAgent-7B-v1
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.