The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
Xin Li, Hao Jiang, Annan Wang, Yichi Zhang, Chau Yuen

TL;DR
This paper analyzes the limits of on-policy distillation for structured outputs, deriving a threshold for safe extrapolation and demonstrating how operating near this boundary affects model performance and format adherence.
Contribution
It introduces a closed-form safety threshold for reward extrapolation in structured output distillation and extends the analysis to JSON tasks, with empirical validation on Amazon Fashion.
Findings
The derived threshold predicts the extrapolation cliff accurately.
Operating just below the threshold achieves in-domain parity with fewer parameters.
Format adherence remains stable below the cliff, while validity sharply declines above it.
Abstract
On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
