BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
Peng Lai, Zhihao Ou, Yong Wang, Longyue Wang, Jian Yang, Yun Chen, Guanhua Chen

TL;DR
BiasScope is a novel framework that automatically detects unknown biases in LLM-based evaluation methods, enhancing robustness and reliability by transforming bias discovery into an automated, comprehensive process.
Contribution
We introduce BiasScope, the first automated system for discovering potential unknown biases in LLM evaluations, and extend JudgeBench to JudgeBench-Pro for more rigorous robustness testing.
Findings
BiasScope effectively uncovers biases across model families.
Powerful LLM evaluators still have over 50% error rates on JudgeBench-Pro.
Automated bias discovery improves evaluation robustness.
Abstract
LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bias, which has primarily been studied in terms of known biases and their impact on evaluation outcomes, while automated and systematic exploration of potential unknown biases is still lacking. Nevertheless, such exploration is crucial for enhancing the robustness and reliability of evaluations. To bridge this gap, we propose BiasScope, a LLM-driven framework for automatically and at scale discovering potential biases that may arise during model evaluation. BiasScope can uncover potential biases across different model families and scales, with its generality and effectiveness validated on the JudgeBench dataset. It overcomes the limitations of existing approaches, transforming bias discovery…
Peer Reviews
Decision·ICLR 2026 Poster
* Important research question: The work tackles a major open challenge in the “LLM-as-a-Judge” field: detecting unknown and latent biases, which directly affect fairness and reliability in automatic evaluation. * Strong experimental coverage. Results are presented across seven target models (Table 1), multiple domains (math, reasoning, coding, knowledge), and ablations on teacher models (Table 2), validation timing (Table 3), and explanation depth (Table 4).
* Limited novelty: While the paper presents a well-engineered framework, I found that the most significant perturbation component has already been presented in CALM. The methodological novelty lies mainly in adding a search-based framework based on CALM. Consequently, the conceptual advancement, though practical and valuable, may be viewed as incremental rather than groundbreaking. * Dependence on teacher quality. Table 2 shows strong teacher influence, yet the framework assumes the teacher its
Novel framework for automated bias discovery in LLM evaluators. Strong experimental evidence across multiple models and datasets. The new JudgeBench-Pro benchmark contributes valuable resources for future research.
While the contribution is significant, several limitations remain. The iterative bias discovery process is computationally demanding, restricting scalability for large evaluations. The framework depends heavily on the quality of the teacher model — if the teacher itself is biased, those biases may cascade into the discovery process. Additionally, the interpretability of the “discovered” biases is often shallow; many are validated statistically but not semantically explained, which reduces thei
* This work has a clear and effective definition of what it seeks to control, namely -- "Systematic, non-random tendencies exhibited by a Judge LLM during answer evaluation, which can lead its assessments to deviate from objective and equitable standards, thereby affecting the robustness and reliability of the evaluation". Their validation methodology (page 4, lines 184-198) operationalizes this: if injecting a bias into an incorrect response causes judges to choose it more often, that bias has
I could certainly be persuaded that this paper is ready for ICLR, but I think there are enough experimental gaps that I cannot fully endorse it as-is. * All biased responses are generated by Qwen2.5-72B, which has its own errors, biases and preferences. For instance, Table 1 includes 4 Qwen models (Qwen2.5-1.5B, Qwen2.5-7B, Qwen2.5-14B, Qwen3-8B), so the authors may inject self-preference bias automatically alongside whatever biases they discover. At the very least, the authors should ablate th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Imbalanced Data Classification Techniques · Topic Modeling
