The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
Abhinav Kumar Singh, Harsha Vardhan Khurdula, Yoeven D Khemlani, and Vineet Agarwal

TL;DR
The paper introduces SOB, a comprehensive multi-source benchmark for evaluating structured output quality in large language models across text, images, and audio, highlighting current models' high schema compliance but limited value accuracy.
Contribution
It presents SOB, a novel multi-source benchmark with diverse data types and evaluation metrics, enabling fair comparison of models' structured output capabilities beyond schema adherence.
Findings
Models achieve near-perfect schema compliance.
Value accuracy peaks at 83.0% on text, 67.2% on images, and 23.7% on audio.
Longer contexts significantly reduce extraction accuracy.
Abstract
Large Language Models are increasingly being deployed to extract structured data from unstructured and semi-structured sources: parsing invoices, medical records, and converting PDF documents to database entries. Yet existing benchmarks for structured output generation either focus on schema compliance alone, or evaluate value correctness within a single source domain. We introduce SOB (The Structured Output Benchmark), a multi-source benchmark spanning three source modalities: native text, images, and audio conversations. All models receive a text-normalized representation of their context regardless of source modality; this deliberate design isolates structured-output capability from raw vision or speech-processing quality, ensuring a fair, source-agnostic comparison. Our benchmark comprises 5,000 text evaluation records derived from multi-hop QA drawn from a 25,091-record full…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
