Likelihood hacking in probabilistic program synthesis
Jacek Karwowski, Younesse Kaddar, Zihuiwen Ye, Nikolay Malkin, Sam Staton

TL;DR
This paper identifies a vulnerability called likelihood hacking in probabilistic programming models trained with reinforcement learning, formalizes conditions to prevent it, and demonstrates practical safety measures that effectively mitigate this issue.
Contribution
It formalizes likelihood hacking in probabilistic programming, provides syntactic safety conditions, and develops SafeStan, a modified language that prevents likelihood hacking during model training.
Findings
Likelihood hacking can be exploited early in training.
SafeStan effectively prevents likelihood hacking.
Language-level safety constraints are practically effective.
Abstract
When language models are trained by reinforcement learning (RL) to write probabilistic programs, they can artificially inflate their marginal-likelihood reward by producing programs whose data distribution fails to normalise instead of fitting the data better. We call this failure likelihood hacking (LH). We formalise LH in a core probabilistic programming language (PPL) and give sufficient syntactic conditions for its prevention, proving that a safe language fragment satisfying these conditions cannot produce likelihood-hacking programs. Empirically, we show that GRPO-trained models generating PyMC code discover LH exploits within the first few training steps, driving violation rates well above the untrained-model baseline. We implement 's conditions as , a LH-resistant modification of Stan, and show empirically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFormal Methods in Verification · Software Testing and Debugging Techniques · Software Engineering Research
