Model Equality Testing: Which Model Is This API Serving?

Irena Gao; Percy Liang; and Carlos Guestrin

arXiv:2410.20247·cs.LG·April 10, 2025

Model Equality Testing: Which Model Is This API Serving?

Irena Gao, Percy Liang, and Carlos Guestrin

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces Model Equality Testing, a statistical method to detect if an API's model output distribution differs from a reference, revealing potential model modifications without user notification.

Contribution

It formalizes the problem as a two-sample test and demonstrates the effectiveness of MMD-based tests, applied to real-world APIs to identify distributional differences.

Findings

01

MMD-based tests achieve 77.4% power with 10 samples per prompt.

02

Applied to commercial APIs, 11 out of 31 endpoints showed distributional differences.

03

Simple string kernels are effective for detecting model distortions.

Abstract

Users often interact with large language models through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studio). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution -- possibly without notifying users. We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem, where the user collects samples from the API and a reference distribution and conducts a statistical test to see if the two distributions are the same. We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The paper is overall well written, organized and easy to follow. The application of two sample testing using the Maximum Mean Discrepancy kernel is novel for studying distributions coming from different models.

Weaknesses

- I’m not convinced of the significance of this problem. Can this problem be solved through policy if the users ask from providers to disclose such changes? The API provides can be even motivated to charge users differently based on the optimizations done to models. - Evaluations can be stronger: - The authors claim that the proposed method works using an average of 10 samples per prompt across 20-25 prompts. Since the paper relies on empirical analysis, I would love to see more analysis ba

Reviewer 02Rating 6Confidence 4

Strengths

1. This paper is well-organized and written, making it easy to understand. 2. The problem addressed is valuable, as APIs have indeed become one of the mainstream forms of LLM applications. 3. The method design is reasonable and concise, and the experiments appear to be effective.

Weaknesses

1. The study primarily focuses on the LLaMA series models. Although this aligns with the paper’s emphasis, I recommend that the authors verify the generalizability of MMD across more models. 2. The computation cost of MMD appears to be somewhat high.

Reviewer 03Rating 6Confidence 3

Strengths

- The overall idea of providing more analytic tools for black-box models / APIs seems important and interesting. I think many people reliant on these black-box APIs are left at their mercy and have little means of understanding how different service providers could affect their system performance. - The public release of generated samples offers valuable data for community use and future API change tracking. - Providing a framework for analyzing APIs on non-classification based (i.e. difficult t

Weaknesses

- One immediate problem I see is that acquiring the reference distribution on a user-defined task requires setting up the reference LM anyways, which was something that is acknowledged to be inconvenient or infeasible in many cases. That would make this method impossible in certain scenarios. - The hamming kernel's effectiveness across different tasks raises concerns, particularly for open-ended tasks like creative generation where diverse outputs may be desirable. The significance tests may str

Code & Models

Repositories

i-gao/model-equality-testing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Reliability and Analysis Research · Software System Performance and Reliability

MethodsLLaMA