TL;DR
This paper demonstrates that probabilistic calibration in language models can be improved through fine-tuning, using synthetic prompts and two methods, enhancing structured sampling fidelity across multiple models.
Contribution
It introduces two calibration fine-tuning methods, soft-target and hard-target, showing they improve probabilistic calibration in language models.
Findings
Both methods significantly improve structured-sampling fidelity.
Hard-target fine-tuning excels in numeric sampling tasks.
Soft-target fine-tuning performs better on broad stochastic generation tasks.
Abstract
Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
