Mitigating loss of control in advanced AI systems through instrumental goal trajectories
Willem Fourie

TL;DR
This paper introduces instrumental goal trajectories (IGTs) as a novel organizational approach to mitigate AI systems' loss of control by monitoring resource access pathways, complementing existing technical safety measures.
Contribution
It proposes IGTs as a new framework to monitor and intervene in AI capability development through organizational artefacts, expanding safety strategies beyond technical solutions.
Findings
IGTs provide concrete intervention points for AI safety.
Monitoring resource access pathways helps control AI capabilities.
Organizational artefacts can signal potential safety issues.
Abstract
Researchers at artificial intelligence labs and universities are concerned that highly capable artificial intelligence (AI) systems may erode human control by pursuing instrumental goals. Existing mitigations remain largely technical and system-centric: tracking capability in advanced systems, shaping behaviour through methods such as reinforcement learning from human feedback, and designing systems to be corrigible and interruptible. Here we develop instrumental goal trajectories to expand these options beyond the model. Gaining capability typically depends on access to additional technical resources, such as compute, storage, data and adjacent services, which in turn requires access to monetary resources. In organisations, these resources can be obtained through three organisational pathways. We label these pathways the procurement, governance and finance instrumental goal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Human-Automation Interaction and Safety · Embodied and Extended Cognition
