Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis
Kejian Zhu, Shangqing Tu, Zhuoran Jin, Lei Hou, Juanzi Li, Jun Zhao

TL;DR
This paper introduces a novel method to improve the trustworthiness of large language model evaluations by identifying and suppressing shortcut neurons, thereby reducing contamination effects and correlating highly with established benchmarks.
Contribution
It proposes a new approach using shortcut neuron analysis and patching to mitigate contamination in LLM evaluation, enhancing reliability without building new benchmarks.
Findings
High correlation ($ ho$ > 0.95) with MixEval benchmark
Effective suppression of shortcut neurons improves evaluation accuracy
Method generalizes across different benchmarks and settings
Abstract
The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical. In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through comparative and causal analysis. Building on this, we introduce an evaluation method called shortcut neuron patching to suppress shortcut neurons. Experiments validate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
