Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging
Jon-Paul Cacioli

TL;DR
This study investigates how large language models exhibit a stable response-position distribution under prompted sandbagging, revealing a soft attractor that is content-invariant and highly stable across different models and conditions.
Contribution
The paper demonstrates that prompted sandbagging in language models leads to a stable, content-invariant response-position distribution, indicating a soft distributional attractor at the model level.
Findings
Response-position distribution remains highly stable under content rotation (Pearson r = 0.9994).
Accuracy peaks at 72.1% when the correct answer is in position E.
Qwen-2.5-7B shows no distributional shift, serving as a negative control.
Abstract
A predecessor pilot (Cacioli, 2026) found that Llama-3-8B implements prompted sandbagging as positional collapse rather than answer avoidance. However, fixed option ordering in MMLU-Pro left open whether this reflected a model-level position-dominant policy or dataset-level distractor structure. This pre-registered follow-up (3 models, 2,000 MMLU-Pro items, 4 conditions, 24,000 primary trials) added cyclic option-order randomisation as the critical control. The pre-registered item-level same-letter diagnostic did not confirm deterministic position-tracking (same-letter rate 37.3%, below the 50% threshold). However, pre-specified supporting analyses revealed that the response-position distribution under sandbagging was highly stable under complete content rotation (Pearson r = 0.9994; Jensen-Shannon divergence = 0.027, compared to 0.386 between honest and sandbagging conditions).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
