# Human-anchored longitudinal comparison of generative AI with a bias-calibrated LLM-as-judge

**Authors:** Thomas Wiese

PMC · DOI: 10.1371/journal.pone.0339920 · PLOS One · 2026-02-02

## TL;DR

This study tracks changes in three major AI models over time using human and AI judges, revealing how their performance and behavior evolve.

## Contribution

A novel human-anchored longitudinal evaluation framework with bias-calibrated LLM-as-judge for tracking service drift in evolving AI models.

## Key findings

- Three models showed divergent stability: one stable, one improving, and one degrading mid-study.
- LLM-as-judge calibration increased agreement with human raters (τ = 0.59–0.68) and reduced volatility.
- Safety metrics co-varied with drift events, indicating behavioral shifts rather than causal changes.

## Abstract

Service LLMs evolve without public changelogs, complicating reproducible evaluation. We present a preregistered human-anchored longitudinal study that tracks three major model families over ten weekly waves using a fixed prompt bank (N = 240) across six domains. Blinded human raters provided correctness judgments, and a bias-calibrated LLM-as-judge produced secondary pairwise preferences corrected weekly via a Bradley–Terry model. Mixed-effects modeling and change-point detection (PELT with MBIC penalty) identified significant service drift patterns. Results show divergent stability trajectories among models: one stable, one improving, and one degrading mid-study. Judge calibration increased agreement with humans (τ = 0.59–0.68) while reducing volatility. Safety metrics co-varied with drift events, suggesting behavioral shifts rather than confirmed causal changes. All data, prompts, rubrics, and parameter configurations are provided in supporting files S1–S6.

## Full-text entities

- **Diseases:** toxicity (MESH:D064420), LLMs (MESH:D007806)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12863567/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12863567/full.md

## References

18 references — full list in the complete paper: https://tomesphere.com/paper/PMC12863567/full.md

---
Source: https://tomesphere.com/paper/PMC12863567