Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

Mohammed Abu Baker; Luca Baroni; Dan Wilhelm

arXiv:2605.00994·cs.CL·May 5, 2026

Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

Mohammed Abu Baker, Luca Baroni, Dan Wilhelm

PDF

TL;DR

This paper presents a perplexity-based method to identify finetuning objectives of large language models by analyzing their overgeneralization tendencies, applicable even without internal model access.

Contribution

The authors introduce a simple, effective perplexity differencing technique to reveal finetuning goals of diverse models, including API-only models, without needing internal internals.

Findings

01

Method successfully reveals finetuning objectives across 76 diverse models.

02

Effective even without access to original pre-finetuning checkpoints.

03

Models trained on synthetic data or to produce specific phrases are highly susceptible.

Abstract

Finetuning can significantly modify the behavior of large language models, including introducing harmful or unsafe behaviors. To study these risks, researchers develop model organisms: models finetuned to exhibit specific known behaviors for controlled experimentation. Identifying these behaviors remains challenging. We show that a simple perplexity-based method can surface finetuning objectives from model organisms by leveraging their tendency to overgeneralize their finetuned behaviors beyond the intended context. First, we generate diverse completions from the finetuned model using short random prefills drawn from general corpora. Second, we rank completions by decreasing perplexity gap between reference and finetuned models. The top-ranked completions often reveal the finetuning objectives, without requiring model internals or prior assumptions about the behavior. We evaluate this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.