Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length
Zhiyu Xu, Jia Liu, Yixin Wang, Yuqi Gu

TL;DR
This paper introduces Latency-Response Theory (LaRT), a novel evaluation framework for large language models that jointly models response accuracy and chain-of-thought length, providing more precise assessments than existing methods.
Contribution
LaRT extends Item Response Theory by incorporating response latency, modeling the correlation between reasoning ability and speed, and demonstrating improved estimation accuracy and evaluation metrics.
Findings
LaRT outperforms IRT in estimation accuracy and confidence interval precision.
A strong negative correlation between ability and speed is observed across benchmarks.
LaRT provides different and more reliable LLM rankings than IRT.
Abstract
The proliferation of Large Language Models (LLMs) necessitates valid evaluation methods to guide downstream applications and actionable future improvements. The Item Response Theory (IRT) has recently emerged as a promising framework for evaluating LLMs via their response accuracy. Beyond simple response accuracy, LLMs' chain of thought (CoT) lengths serve as a vital indicator of their reasoning ability. To leverage the CoT length information to assist the evaluation of LLMs, we propose Latency-Response Theory (LaRT) to jointly model the response accuracy and CoT length by introducing the latent ability, latent speed, and a key correlation parameter between them. We derive an efficient estimation algorithm and establish rigorous identifiability results for the population parameters to ensure the statistical validity of estimation. Theoretical asymptotic analyses and simulation studies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · Psychometric Methodologies and Testing
