TL;DR
This paper introduces a rank-based uniformity test to verify if black-box LLM API responses match those of an authentic model, ensuring model integrity and detecting substitutions efficiently.
Contribution
It presents a novel, query-efficient statistical test that robustly detects model substitutions and modifications in black-box LLM APIs without revealing detectable query patterns.
Findings
The method accurately detects quantization and fine-tuning modifications.
It outperforms prior methods in statistical power under limited queries.
The approach is robust against adversarial response rerouting or mixing.
Abstract
As API access becomes a primary interface to large language models (LLMs), users often interact with black-box systems that offer little transparency into the deployed model. To reduce costs or maliciously alter model behaviors, API providers may discreetly serve quantized or fine-tuned variants, which can degrade performance and compromise safety. Detecting such substitutions is difficult, as users lack access to model weights and, in most cases, even output logits. To tackle this problem, we propose a rank-based uniformity test that can verify the behavioral equality of a black-box LLM to a locally deployed authentic model. Our method is accurate, query-efficient, and avoids detectable query patterns, making it robust to adversarial providers that reroute or mix responses upon the detection of testing attempts. We evaluate the approach across diverse threat scenarios, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
