LLMStructBench: Benchmarking Large Language Model Structured Data Extraction
S\"onke Tenckhoff, Mario Koddenbrock, Erik Rodner

TL;DR
LLMStructBench is a new benchmark for evaluating large language models on extracting structured data from text and generating valid JSON, highlighting the importance of prompting strategies over model size.
Contribution
It introduces a comprehensive benchmark and dataset for assessing LLMs on structured data extraction, along with new metrics and insights into prompting strategies.
Findings
Prompting strategy selection impacts parsing reliability more than model size.
Smaller models can achieve structural validity with proper prompts.
Trade-offs exist between validity and semantic accuracy.
Abstract
We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity and enables systematic testing across 22 models and five prompting strategies. We further introduce complementary performance metrics that capture both token-level accuracy and document-level validity, facilitating rigorous comparison of model, size, and prompting effects on parsing reliability. In particular, we show that choosing the right prompting strategy is more important than standard attributes such as model size. This especially ensures structural validity for smaller or less reliable models but increase the number of semantic errors. Our benchmark suite is an step towards…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Mathematics, Computing, and Information Processing
