Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator

Peiwen Yuan; Yiwei Li; Shaoxiong Feng; Xinglin Wang; Yueqi Zhang; Jiayi Shi; Chuyi Tan; Boyuan Pan; Yao Hu; Kan Li

arXiv:2505.20738·cs.CL·May 28, 2025

Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator

Peiwen Yuan, Yiwei Li, Shaoxiong Feng, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li

PDF

Open Access

TL;DR

This paper identifies and mitigates self-bias in LLM-generated benchmarks, proposing Silencer, a framework that reduces bias and improves evaluation accuracy by leveraging generator heterogeneity.

Contribution

The paper introduces Silencer, a novel framework that neutralizes self-bias in LLM-generated benchmarks using heterogeneity among multiple generators.

Findings

01

Silencer reduces self-bias to near zero.

02

Significantly improves evaluation correlation with human benchmarks.

03

Demonstrates strong generalizability across settings.

Abstract

LLM-as-Benchmark-Generator methods have been widely studied as a supplement to human annotators for scalable evaluation, while the potential biases within this paradigm remain underexplored. In this work, we systematically define and validate the phenomenon of inflated performance in models evaluated on their self-generated benchmarks, referred to as self-bias, and attribute it to sub-biases arising from question domain, language style, and wrong labels. On this basis, we propose Silencer, a general framework that leverages the heterogeneity between multiple generators at both the sample and benchmark levels to neutralize bias and generate high-quality, self-bias-silenced benchmark. Experimental results across various settings demonstrate that Silencer can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Simulation Techniques and Applications