Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction
Hui Wen Goh, Jonas Mueller

TL;DR
CONSTRUCT is a real-time trustworthiness scoring system for LLM structured outputs that identifies errors and helps prioritize human review without requiring labeled data or model modifications.
Contribution
It introduces a novel, model-agnostic uncertainty estimator for structured outputs, supporting complex schemas and providing detailed trust scores for each output field.
Findings
CONSTRUCT outperforms existing techniques in error detection precision and recall.
It is applicable to black-box LLM APIs without logprobs or retraining.
The paper introduces one of the first public benchmarks for LLM structured output quality.
Abstract
Structured Outputs from current LLMs exhibit sporadic errors, hindering enterprise AI deployment. We present CONSTRUCT, a real-time uncertainty estimator that scores the trustworthiness of LLM Structured Outputs. Lower-scoring outputs are more likely to contain errors, enabling automatic prioritization of limited human review bandwidth. CONSTRUCT additionally scores the trustworthiness of each field within a Structured Output, helping reviewers quickly identify which parts of the output are incorrect. Our method is suitable for any LLM (including black-box LLM APIs without logprobs), does not require labeled training data or custom model deployment, and supports complex Structured Outputs with heterogeneous fields and nested JSON schemas. We also introduce one of the first public LLM Structured Output benchmarks with reliable ground-truth values. Over this four-dataset benchmark,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
