Identified-Set Geometry of Distributional Model Extraction under Top-$K$ Censored API Access
Wenhua Nie, ZiCheng Zhu, Jianan Wu, Binhan Luo, Haoran Zheng, Jyh-Shing Roger Jang

TL;DR
This paper analyzes the limits of extracting distributional information from large language model APIs that only reveal top-$K$ logit scores, showing how much private capability can still be recovered despite censorship.
Contribution
It introduces a geometric framework for understanding the identified set of distributions under top-$K$ censoring and quantifies the recovery limits for distributional and KL divergence measures.
Findings
Top-$K$ distillation recovers 12% of private capability.
Full-logit distillation recovers 56% of private capability.
Generation-based extraction recovers 96% of private capability.
Abstract
Modern LLM APIs often reveal only top- logit scores and censor the remaining vocabulary. We study the per-position distribution-recovery limits of this access model. For censoring threshold , the compatible teacher distributions form an identified set whose total-variation diameter is exactly , where is the observed partition function. For KL recovery, we give a computable binary-endpoint lower bound and an asymptotically matching small-ambiguity upper bound, with an extension to reference-aware attackers. Experiments on a Qwen3 math-reasoning teacher reveal a layered extraction hierarchy: on-task top- distillation recovers 12% of private capability, full-logit distillation recovers 56% despite 99% KL closure, and generation-based extraction recovers 96%. Top- censoring therefore limits per-position distribution recovery…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
