The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

Abhinav Kumar Singh; Harsha Vardhan Khurdula; Yoeven D Khemlani; and Vineet Agarwal

arXiv:2604.25359·cs.CL·April 29, 2026

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

Abhinav Kumar Singh, Harsha Vardhan Khurdula, Yoeven D Khemlani, and Vineet Agarwal

PDF

TL;DR

The paper introduces SOB, a comprehensive multi-source benchmark for evaluating structured output quality in large language models across text, images, and audio, highlighting current models' high schema compliance but limited value accuracy.

Contribution

It presents SOB, a novel multi-source benchmark with diverse data types and evaluation metrics, enabling fair comparison of models' structured output capabilities beyond schema adherence.

Findings

01

Models achieve near-perfect schema compliance.

02

Value accuracy peaks at 83.0% on text, 67.2% on images, and 23.7% on audio.

03

Longer contexts significantly reduce extraction accuracy.

Abstract

Large Language Models are increasingly being deployed to extract structured data from unstructured and semi-structured sources: parsing invoices, medical records, and converting PDF documents to database entries. Yet existing benchmarks for structured output generation either focus on schema compliance alone, or evaluate value correctness within a single source domain. We introduce SOB (The Structured Output Benchmark), a multi-source benchmark spanning three source modalities: native text, images, and audio conversations. All models receive a text-normalized representation of their context regardless of source modality; this deliberate design isolates structured-output capability from raw vision or speech-processing quality, ensuring a fair, source-agnostic comparison. Our benchmark comprises 5,000 text evaluation records derived from multi-hop QA drawn from a 25,091-record full…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.