Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study
Mingyang Song, Mao Zheng, Xuan Luo, Yue Pan

TL;DR
This study explores how many-shot in-context learning prompts can improve the reliability of large language models when used as evaluators, showing that specific prompt designs enhance evaluation consistency and accuracy.
Contribution
The paper introduces two novel many-shot ICL prompt templates, MSwR and MSoR, to mitigate biases in LLM evaluators and demonstrates their effectiveness with GPT-4o.
Findings
GPT-4o performs better in many-shot regimes.
MSwR prompts outperform MSoR in evaluation tasks.
Increasing in-context examples improves evaluation quality.
Abstract
Utilizing Large Language Models (LLMs) as evaluators to assess the performance of LLMs has garnered attention. However, this kind of evaluation approach is affected by potential biases within LLMs, raising concerns about the accuracy and reliability of the evaluation results of LLMs. To address this problem, we propose and study two many-shot In-Context Learning (ICL) prompt templates to help LLM evaluators mitigate potential biases: Many-Shot with Reference (MSwR) and Many-Shot without Reference (MSoR). Specifically, the former utilizes in-context examples with model-generated evaluation rationales as references, while the latter does not include these references. Using these prompt designs, we investigate the impact of increasing the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4o,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Legal Education and Practice Innovations
