Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation
Jon-Paul Cacioli

TL;DR
This study investigates how instruction complexity influences whether small instruction-tuned LLMs engage with content or rely on positional shortcuts during adversarial evaluation, revealing a spectrum of response behaviors.
Contribution
It uncovers the boundary conditions under which instruction complexity causes models to switch from content-aware to position-based shortcuts, highlighting the impact of multi-step instructions.
Findings
Vague instructions moderately reduce accuracy while maintaining content engagement.
Standard instructions induce positional entropy collapse with partial content sensitivity.
Multi-step instructions cause extreme positional collapse, with responses concentrated on a single position.
Abstract
When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts? We map the boundary between these regimes using a six-condition adversarial instruction-specificity gradient administered to two instruction-tuned LLMs (Llama-3-8B and Llama-3.1-8B) on 2,000 MMLU-Pro items. Distributional screening (response-position entropy) and an independent content-engagement criterion (difficulty-accuracy correlation) jointly characterise each condition. The gradient reveals three regimes rather than a monotonic transition. Vague adversarial instructions produce moderate accuracy reduction with preserved content engagement. Standard sandbagging and capability-imitation instructions produce positional entropy collapse with partial content engagement. A two-step answer-aware avoidance instruction produces extreme…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
