Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks

Yiliang Song; Hongjun An; Jiangan Chen; Xuanchen Yan; Huan Song; Jiawei Shao; Xuelong Li

arXiv:2603.21636·cs.AI·March 31, 2026

Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks

Yiliang Song, Hongjun An, Jiangan Chen, Xuanchen Yan, Huan Song, Jiawei Shao, Xuelong Li

PDF

TL;DR

This paper critiques the reliability of LLM benchmarks by analyzing contamination sensitivity and score confidence, proposing an audit framework to improve evaluation robustness.

Contribution

It introduces a novel audit framework using a router-worker setup to assess contamination effects and score confidence in LLM benchmarks.

Findings

01

Widespread contamination-related memory reactivation observed across models.

02

Noisy benchmark conditions often outperform clean controls, indicating contamination influence.

03

Scores may vary significantly in confidence depending on contamination sensitivity.

Abstract

Public benchmarks increasingly govern how large language models (LLMs) are ranked, selected, and deployed. We frame this benchmark-centered regime as Silicon Bureaucracy and AI Test-Oriented Education, and argue that it rests on a fragile assumption: that benchmark scores directly reflect genuine generalization. In practice, however, such scores may conflate exam-oriented competence with principled capability, especially when contamination and semantic leakage are difficult to exclude from modern training pipelines. We therefore propose an audit framework for analyzing contamination sensitivity and score confidence in LLM benchmarks. Using a router-worker setup, we compare a clean-control condition with noisy conditions in which benchmark problems are systematically deleted, rewritten, and perturbed before being passed downstream. For a genuinely clean benchmark, noisy conditions should…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.