Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models
Bryan E. Tuck, Rakesh M. Verma

TL;DR
This study evaluates large language models' ability to satisfy orthographic constraints in text generation, revealing cross-family differences, capacity effects, and systematic failures on unusual words.
Contribution
It provides a comprehensive cross-family evaluation of LLMs on orthographic constraints, analyzing performance gaps, capacity effects, and human-model alignment.
Findings
Cross-family performance gaps are larger than within-family scaling gains.
High-capacity models benefit more from increased thinking budget.
Models systematically fail on words with unusual orthography despite high human success.
Abstract
Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-family evaluation remains limited. We evaluate 39 configurations spanning three model families (Qwen3, Claude Haiku 4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Cross-family differences produce substantially larger performance gaps (2.0-2.2x, F1 = 0.761 vs. 0.343) than parameter scaling within families (83% gain from 4B to 32B scaling), and a partial-correlation analysis rules out tokenizer design as a confound for within-family scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade, showing inconsistent compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
