TL;DR
This paper introduces ScrapeGraphAI-100k, a large, real-world dataset of schema-constrained extraction events for training and benchmarking language models in structured output tasks.
Contribution
It provides a substantial, diverse dataset from real practitioner workloads, addressing limitations of synthetic and text-only datasets for schema-constrained generation.
Findings
A 1.7B fine-tuned model closely matches GPT-5-nano's output distribution.
The dataset covers 18,000+ schemas across 15 languages, with high coverage of English and Chinese.
Schema complexity impacts model performance, revealing sharp failure thresholds.
Abstract
Producing output that conforms to a specified JSON schema underlies tool use, structured extraction, and knowledge base construction in modern large language models. Despite this centrality, public datasets for the task remain small, synthetic, or text-only, and rarely pair real page content with the prompts and schemas used in practice. We introduce ScrapeGraphAI-100k, 93,695 schema-constrained extraction events collected via opt-in ScrapeGraphAI telemetry in Q2--Q3 2025, deduplicated and balanced by schema from 9M raw events. The corpus spans 18 000+ unique schemas across 15 named languages plus a long-tail Other category, with English and Traditional Chinese covering 88% of detected content, each instance pairs Markdown-converted page content with a prompt, schema, LLM response, and per-example jsonschema-rs structural conformance labels (semantic correctness is out of scope, and raw…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
