The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
Jikai Jin, Vasilis Syrgkanis

TL;DR
This paper proposes a combined approach using observational logs, small randomized experiments, and offline simulation to accurately evaluate language models despite confounding biases in logged data.
Contribution
It introduces a novel three-source design and an identification theorem that enable causal evaluation of language models by integrating observational, experimental, and simulated data.
Findings
No estimator family dominates across all regimes.
Performance depends on the amount of experimental supervision.
Aligning target reward with observational structure improves estimates.
Abstract
Offline evaluation of language models from usage logs is biased when model choice is confounded: the same user-side factors that influence which model is used can also influence how its output is judged, so raw comparisons of logged scores mix self-selected populations rather than estimating a common quantity of interest. A small randomized experiment can break this bias by overriding model choice, but in practice such experiments are scarce and costly. We study a three-source design that combines a large confounded observational log (OBS) for scale, a small randomized experiment (EXP) for unconfounded scoring, and an offline simulator (SIM) that replays candidate models on cached contexts. Our main result is an identification theorem showing that the randomized experiment and the simulator are together enough to recover causal model values; the observational log enters only afterward,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
