Lost in Space: Finding the Right Tokens for Structured Output
Sil Hamilton, David Mimno

TL;DR
This paper investigates how different structured output formats affect the performance of language models in NLP tasks, finding that formats aligned with conventions and including leading whitespace improve accuracy and robustness.
Contribution
It systematically compares the impact of various output formats on model performance and provides best practices for using structured outputs in zero-shot classification.
Findings
Formats respecting conventions improve accuracy by 5-10%.
Including leading whitespace enhances model performance, especially for smaller models.
Structured output formats influence downstream task effectiveness.
Abstract
General-purpose language models are trained to produce varied natural language outputs, but for some tasks, like annotation or classification, we need more specific output formats. LLM systems increasingly support structured output, which enforces formats by sampling tokens according to a grammar -- but also unpredictably reduces downstream performance. Are there systematic differences between grammars that appear semantically (and often visually) similar to humans? To answer this, we test four popular model families with five varying output formats on four common NLP benchmarks. We find all models perform most accurately when guided to use formats respecting convention, such as letters for multiple choice and real numbers for numerical prediction. Performance also improves by 5%-10% when guiding models to return tokens incorporating leading whitespace, with smaller models benefiting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
