Model Equality Testing: Which Model Is This API Serving?
Irena Gao, Percy Liang, and Carlos Guestrin

TL;DR
This paper introduces Model Equality Testing, a statistical method to detect if an API's model output distribution differs from a reference, revealing potential model modifications without user notification.
Contribution
It formalizes the problem as a two-sample test and demonstrates the effectiveness of MMD-based tests, applied to real-world APIs to identify distributional differences.
Findings
MMD-based tests achieve 77.4% power with 10 samples per prompt.
Applied to commercial APIs, 11 out of 31 endpoints showed distributional differences.
Simple string kernels are effective for detecting model distortions.
Abstract
Users often interact with large language models through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studio). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution -- possibly without notifying users. We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem, where the user collects samples from the API and a reference distribution and conducts a statistical test to see if the two distributions are the same. We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per…
Peer Reviews
Decision·ICLR 2025 Poster
The paper is overall well written, organized and easy to follow. The application of two sample testing using the Maximum Mean Discrepancy kernel is novel for studying distributions coming from different models.
- I’m not convinced of the significance of this problem. Can this problem be solved through policy if the users ask from providers to disclose such changes? The API provides can be even motivated to charge users differently based on the optimizations done to models. - Evaluations can be stronger: - The authors claim that the proposed method works using an average of 10 samples per prompt across 20-25 prompts. Since the paper relies on empirical analysis, I would love to see more analysis ba
1. This paper is well-organized and written, making it easy to understand. 2. The problem addressed is valuable, as APIs have indeed become one of the mainstream forms of LLM applications. 3. The method design is reasonable and concise, and the experiments appear to be effective.
1. The study primarily focuses on the LLaMA series models. Although this aligns with the paper’s emphasis, I recommend that the authors verify the generalizability of MMD across more models. 2. The computation cost of MMD appears to be somewhat high.
- The overall idea of providing more analytic tools for black-box models / APIs seems important and interesting. I think many people reliant on these black-box APIs are left at their mercy and have little means of understanding how different service providers could affect their system performance. - The public release of generated samples offers valuable data for community use and future API change tracking. - Providing a framework for analyzing APIs on non-classification based (i.e. difficult t
- One immediate problem I see is that acquiring the reference distribution on a user-defined task requires setting up the reference LM anyways, which was something that is acknowledged to be inconvenient or infeasible in many cases. That would make this method impossible in certain scenarios. - The hamming kernel's effectiveness across different tasks raises concerns, particularly for open-ended tasks like creative generation where diverse outputs may be desirable. The significance tests may str
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Reliability and Analysis Research · Software System Performance and Reliability
MethodsLLaMA
