Evaluation of Large Language Models via Coupled Token Generation

Nina Corvelo Benz; Stratis Tsirtsis; Eleni Straitouri; Ivi Chatzi; Ander Artola Velasco; Suhas Thejaswi; and Manuel Gomez-Rodriguez

arXiv:2502.01754·cs.CL·March 26, 2026

Evaluation of Large Language Models via Coupled Token Generation

Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi, and Manuel Gomez-Rodriguez

PDF

Open Access 1 Repo

TL;DR

This paper introduces a causal model for coupled token generation in large language models, enabling more reliable evaluation by controlling randomness, which reduces sample requirements and reveals potential biases in model rankings.

Contribution

We develop a causal framework for coupled autoregressive generation, demonstrating its advantages in evaluation efficiency and exposing confounding effects of randomness on model rankings.

Findings

01

Coupled generation reduces sample needs by up to 75% for evaluation.

02

Evaluation results are consistent between coupled and vanilla methods on benchmarks.

03

Model rankings can differ significantly under coupled versus vanilla generation.

Abstract

State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

networks-learning/coupled-llm-evaluation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsLLaMA