Steerability of Instrumental-Convergence Tendencies in LLMs
Jakub Hoscilowicz

TL;DR
This paper investigates how the ability to steer AI systems toward desired behaviors is affected by capability growth, highlighting a safety-security trade-off and demonstrating that anti-instrumental prompts can significantly reduce instrumental convergence in large language models.
Contribution
It introduces the concept of instrumental convergence steerability, distinguishes safety and security aspects, and empirically shows how anti-instrumental prompts can mitigate convergence tendencies in LLMs.
Findings
Anti-instrumental prompts sharply reduce convergence rates.
Larger models show lower convergence under anti-instrumental prompts.
Steerability trade-offs pose safety and security challenges.
Abstract
We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). A central question is whether capability growth reduces steerability and risks control collapse. We also distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety--security dilemma of AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability for malicious actors to elicit harmful behaviors. This tension presents a significant challenge for open-weight models, which currently exhibit high steerability via common techniques like fine-tuning or adversarial attacks. Using Qwen3 and InstrumentalEval, we find that a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques
