A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions
Laur\`ene Vaugrante, Mathias Niepert, Thilo Hagendorff

TL;DR
This paper highlights the risk of a replication crisis in LLM behavior research, demonstrating inconsistent results across prompt techniques and emphasizing the need for standardized evaluation methodologies.
Contribution
It provides empirical evidence of methodological weaknesses in current LLM evaluation practices and proposes solutions for more reliable assessment frameworks.
Findings
Most prompt engineering techniques showed no significant effect on LLM reasoning.
Current evaluation methods have notable methodological weaknesses.
Recommendations for developing robust, standardized evaluation protocols.
Abstract
In an era where large language models (LLMs) are increasingly integrated into a wide range of everyday applications, research into these models' behavior has surged. However, due to the novelty of the field, clear methodological guidelines are lacking. This raises concerns about the replicability and generalizability of insights gained from research on LLM behavior. In this study, we discuss the potential risk of a replication crisis and support our concerns with a series of replication experiments focused on prompt engineering techniques purported to influence reasoning abilities in LLMs. We tested GPT-3.5, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, Llama 3-8B, and Llama 3-70B, on the chain-of-thought, EmotionPrompting, ExpertPrompting, Sandbagging, as well as Re-Reading prompt engineering techniques, using manually double-checked subsets of reasoning benchmarks including CommonsenseQA,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Layer Normalization · Linear Warmup With Cosine Annealing · Adam · Linear Layer · Residual Connection · Weight Decay
