Log Probability Tracking of LLM APIs

Timoth\'ee Chauvin; Erwan Le Merrer; Fran\c{c}ois Ta\"iani; Gilles Tredan

arXiv:2512.03816·cs.LG·March 2, 2026

Log Probability Tracking of LLM APIs

Timoth\'ee Chauvin, Erwan Le Merrer, Fran\c{c}ois Ta\"iani, Gilles Tredan

PDF

Open Access 3 Reviews

TL;DR

This paper presents a cost-effective method for continuously monitoring LLM API consistency by analyzing log probabilities, capable of detecting small model updates more efficiently than existing approaches.

Contribution

It introduces a simple statistical test using token log probabilities for sensitive, low-cost API auditing and the TinyChange benchmark for evaluating audit sensitivity.

Findings

01

Detects one-step fine-tuning changes

02

More sensitive than existing methods

03

1000x cheaper for continuous monitoring

Abstract

When using an LLM through an API provider, users expect the served model to remain consistent over time, a property crucial for the reliability of downstream applications and the reproducibility of research. Existing audit methods are too costly to apply at regular time intervals to the wide range of available LLM APIs. This means that model updates are left largely unmonitored in practice. In this work, we show that while LLM log probabilities (logprobs) are usually non-deterministic, they can still be used as the basis for cost-effective continuous monitoring of LLM APIs. We apply a simple statistical test based on the average value of each token logprob, requesting only a single token of output. This is enough to detect changes as small as one step of fine-tuning, making this approach more sensitive than existing methods while being 1,000x cheaper. We introduce the TinyChange…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

LT drastically reduces the cost of continuous monitoring, achieving sensitivity gains at a cost that is up to three orders of magnitude cheaper than competing state-of-the-art methods LT provides substantially higher discriminative power and sensitivity than existing approaches. It reliably detects small modifications such as a single step of fine-tuning and demonstrates detection performance for weight pruning at an amplitude 512 times smaller than MET. The authors use permutation tests on th

Weaknesses

The entire methodology is contingent on the API provider supporting and returning log probabilities. Data presented in the paper indicates that only 23% of reachable endpoints on OpenRouter support this. This limits the applicability of the approach. LLM providers can obstruct LT by requiring minimum output token lengths The reliance on log probabilities for only the first output token might miss certain modifications such as adjusting the generation-length parameter.

Reviewer 02Rating 6Confidence 4

Strengths

- Addresses an important and underexplored reproducibility problem: LLM APIs are increasingly widely applied while continously updated, and monitoring these behavior shift is an important topic. - Simple, cost-efficient, and elegant approach: The proposed technique is easy to understand and implement: it basically tests if the distribution of the log probability of the first generated token has changed or not, using a permutation test. This is neat, and also cost-efficient, as it only requires

Weaknesses

- Depends on APIs exposing logprobs: As the authors notice too, only a small fraction of existing API providers (~23% in openrouter) offers logprob access. - Detections w/o directions: The proposed method only detects whether a change occurred, not what changed or how the change looked like. In particular, it is unclear if the change leads to better responses to user queries, or what kind of biases or skills were introduced or forgotten in the model update. In practice, this is often more impor

Reviewer 03Rating 2Confidence 4

Strengths

- The method is very simple to implement, when logprobs are available. - The result that the detection AUC is not affected by prompt length is somewhat interesting. - The authors release a benchmark for evaluating methods like this one. - I like the plots showing how logprobs evolve over time.

Weaknesses

- There is limited technical novelty in the methodology. Checking for differences in logprobs is the de facto approach for checking the correctness of language model implementations and APIs (*e.g.* from the VLLM tests https://github.com/vllm-project/vllm/blob/66a168a197ba214a5b70a74fa2e713c9eeb3251a/tests/models/utils.py#L90). It is well known that this is a more sensitive test than simply checking for text equality. That this is more effective than just checking text outputs is not a surprisin

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Software Engineering Research · Software Testing and Debugging Techniques