Steerability of Instrumental-Convergence Tendencies in LLMs

Jakub Hoscilowicz

arXiv:2601.01584·cs.CL·January 7, 2026

Steerability of Instrumental-Convergence Tendencies in LLMs

Jakub Hoscilowicz

PDF

Open Access

TL;DR

This paper investigates how the ability to steer AI systems toward desired behaviors is affected by capability growth, highlighting a safety-security trade-off and demonstrating that anti-instrumental prompts can significantly reduce instrumental convergence in large language models.

Contribution

It introduces the concept of instrumental convergence steerability, distinguishes safety and security aspects, and empirically shows how anti-instrumental prompts can mitigate convergence tendencies in LLMs.

Findings

01

Anti-instrumental prompts sharply reduce convergence rates.

02

Larger models show lower convergence under anti-instrumental prompts.

03

Steerability trade-offs pose safety and security challenges.

Abstract

We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). A central question is whether capability growth reduces steerability and risks control collapse. We also distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety--security dilemma of AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability for malicious actors to elicit harmful behaviors. This tension presents a significant challenge for open-weight models, which currently exhibit high steerability via common techniques like fine-tuning or adversarial attacks. Using Qwen3 and InstrumentalEval, we find that a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques