Below-Chance Blindness: Prompted Underperformance in Small LLMs Produces Positional Bias Rather than Answer Avoidance
Jon-Paul Cacioli

TL;DR
This study investigates whether below-chance performance in small language models indicates answer avoidance or positional bias, finding that positional heuristics often explain underperformance rather than deliberate answer avoidance.
Contribution
The paper demonstrates that positional bias, not answer avoidance, explains below-chance performance in small LLMs, proposing positional distribution shift as a better detection method.
Findings
Models often ignore sandbagging instructions, maintaining high response identity.
Underperformance can result from positional heuristics, not answer avoidance.
Explicit anti-task instructions can induce below-chance accuracy, but not answer-aware avoidance.
Abstract
Detecting sandbagging--the deliberate underperformance on capability evaluations--is an open problem in AI safety. We tested whether symptom validity testing (SVT) logic from clinical malingering detection could identify sandbagging through below-chance performance (BCB) on forced-choice items. In a pre-registered pilot at the 7-9 billion parameter instruction-tuned scale (3 models, 4 MMLU-Pro domains, 4 conditions, 500 items per cell, 24,000 total trials), the plausibility gate failed. Zero of 12 model-domain cells showed significant below-chance performance under sandbagging instruction. Exploratory analyses revealed three qualitatively distinct failure modes. Qwen-2.5-7B and Phi-3.5-mini largely ignored the sandbagging instruction, with 62-88% response identity with the honest baseline. Llama-3-8B complied substantially but implemented underperformance as a positional heuristic,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
