When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models
Cosimo Galeone, Minsu Park, Giuseppe Ettorre, Daniele Ligorio

TL;DR
This paper addresses the challenge of ensuring correctness and format compliance in small language models' outputs, proposing an iterative prompt optimization method that significantly improves structured output reliability without model fine-tuning.
Contribution
We introduce AloLab, an iterative prompt optimizer that enhances structured output accuracy in small language models using only black-box API access, outperforming static prompts and constrained decoding.
Findings
AloLab achieves 84-87% output accuracy on GSM8K
AloLab reaches 34-40% accuracy on MATH datasets
Meta-agent capability is crucial for optimization quality
Abstract
Deployed language models must produce outputs that are both correct and format-compliant. We study this structured-output reliability gap using two mathematical benchmarks -- GSM8K and MATH -- as a controlled testbed: ground truth is unambiguous and the output contract is strict (JSON with required fields). We evaluate three 7-9B models under five prompting strategies and report output accuracy -- the joint event of mathematical correctness and valid JSON structure -- as the primary metric. A systematic format failure emerges: NAIVE prompting (no system prompt) achieves up to 85% task accuracy on GSM8K but 0% output accuracy across all models and datasets. REFERENCE prompting (a minimal hand-written JSON format prompt) fares little better, yielding 0% output accuracy for two of four models tested. Constrained decoding enforces syntactic validity but incurs 3.6x-8.2x latency overhead and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
