Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities
Thomas Ball, Shuo Chen, Cormac Herley

TL;DR
This paper critically examines GPT-4's performance on deterministic tasks, revealing that small changes in prompts or input data can significantly affect results, highlighting the pitfalls of the fixed-effect fallacy in LLM evaluation.
Contribution
It demonstrates the variability in LLM performance due to prompt and input variations, challenging assumptions about their capabilities and the reliability of current evaluation methods.
Findings
Performance varies with prompt phrasing and input composition.
Small input modifications can cause large performance differences.
Fixed-effect fallacy leads to unreliable generalizations in LLM assessment.
Abstract
In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDiverse Scientific and Economic Studies · Probability and Statistical Research · Credit Risk and Financial Regulations
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection · Linear Layer
