When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

Cosimo Galeone; Minsu Park; Giuseppe Ettorre; Daniele Ligorio

arXiv:2605.02363·cs.CL·May 5, 2026

When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

Cosimo Galeone, Minsu Park, Giuseppe Ettorre, Daniele Ligorio

PDF

TL;DR

This paper addresses the challenge of ensuring correctness and format compliance in small language models' outputs, proposing an iterative prompt optimization method that significantly improves structured output reliability without model fine-tuning.

Contribution

We introduce AloLab, an iterative prompt optimizer that enhances structured output accuracy in small language models using only black-box API access, outperforming static prompts and constrained decoding.

Findings

01

AloLab achieves 84-87% output accuracy on GSM8K

02

AloLab reaches 34-40% accuracy on MATH datasets

03

Meta-agent capability is crucial for optimization quality

Abstract

Deployed language models must produce outputs that are both correct and format-compliant. We study this structured-output reliability gap using two mathematical benchmarks -- GSM8K and MATH -- as a controlled testbed: ground truth is unambiguous and the output contract is strict (JSON with required fields). We evaluate three 7-9B models under five prompting strategies and report output accuracy -- the joint event of mathematical correctness and valid JSON structure -- as the primary metric. A systematic format failure emerges: NAIVE prompting (no system prompt) achieves up to 85% task accuracy on GSM8K but 0% output accuracy across all models and datasets. REFERENCE prompting (a minimal hand-written JSON format prompt) fares little better, yielding 0% output accuracy for two of four models tested. Constrained decoding enforces syntactic validity but incurs 3.6x-8.2x latency overhead and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.