Benchmarking Cognitive Biases in Large Language Models as Evaluators
Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim,, Dongyeop Kang

TL;DR
This paper evaluates the extent of cognitive biases in large language models when they are used as evaluators, revealing significant biases and misalignment with human preferences, which questions their robustness for automatic evaluation tasks.
Contribution
Introduces CoBBLEr, a benchmark for measuring six cognitive biases in LLM evaluation outputs, and assesses the biases and alignment issues across 15 LLMs.
Findings
LLMs exhibit strong cognitive biases in evaluation tasks.
Machine preferences are only 49.6% aligned with human preferences.
LLMs show biases such as egocentric bias, affecting their evaluation robustness.
Abstract
Large Language Models are cognitively biased judges. Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four different size ranges and evaluate their output responses by preference ranking from the other LLMs as evaluators, such as System Star is better than System Square. We then evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators (CoBBLEr), a benchmark to measure six different cognitive biases in LLM evaluation outputs, such as the Egocentric bias where a model prefers to rank its own outputs highly in evaluation. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark (average of 40% of comparisons across all models) within each of their evaluations that…
Peer Reviews
Decision·Submitted to ICLR 2024
- I appreciate the proposed taxonomy of cognitive biases observable in the behavior of LLM evaluators. A comprehensive measurement of these biases is crucial for fair and reliable LLM evaluation. - The paper is well-organized and well-written. The figures offer effective visualizations of the experiment pipeline and results. The literature review adequately covers recent relevant works on (meta-)evaluating LLMs.
- While one of the potential contributions of this paper is its comprehensive analysis of multiple cognitive biases, I believe that most of the biases discussed have been previously identified (see Sections 3.1 and 3.2). The introduction of compassion fade, egocentric bias, and bandwagon effect may bring novelty, yet I have a few reservations: * Regarding compassion fade, the models might be unfamiliar with the names of other models, as their training corpus likely doesn't include information
1. The authors' contribution in proposing a cognitive bias benchmark for evaluating the quality and reliability of Language Model Evaluators (LLMs) is highly valuable for the research community. 2. The study effectively analyzes six different biases, presenting interesting findings. Specifically, the observation that most of the models strongly exhibit several biases, coupled with the low agreement between machine and human preferences, sheds light on the differences between automated and human
1. In this work, the authors primarily focus on pairwise evaluation based on the coherence criterion, without considering other evaluation formats, such as single-document evaluation and interactive evaluation. 2. As recommended in prior research (Wu & Aji, 2023), it is important to evaluate machine-generated text from various perspectives rather than depending solely on a single unified measure. It would be better to explore more diverse evaluation settings to ensure a comprehensive assessmen
* Benchmarking of cognitive biases in LLM assessment is a very important topic because recent studies have extensively used LLM for judgment and were not aware of the limitations of their capabilities. This paper provides a comprehensive investigation of 6 cognitive biases, including the new benchmark and detailed analysis of 15 popular LLMs. * It provides several interesting insights into cognitive biases in LLMs with different scales. For example, larger models prefer the long response more th
* The main weakness is that the number of instructions is small: only 50 question-answering instances. This affects the reliability of conclusions since the data points are limited. It will be better to conduct significant tests and report the p-value for results. * The experiments do not consider ties in the pairwise evaluation, which may affect the conclusion. For example, if the responses of two systems are very similar, it is fine to choose any of them. * For human evaluation, the average RB
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
