Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models
Ram Potham (Independent Researcher), Max Harms (Machine Intelligence Research Institute)

TL;DR
This paper proposes a new approach called CAST for aligning foundation models by making them inherently controllable and responsive to human guidance, aiming to prevent existential risks from misaligned AI.
Contribution
It introduces a paradigm shift from static value-loading to dynamic human empowerment in foundation models, with a comprehensive empirical research agenda for implementation.
Findings
Empirical methods like RLAIF and SFT are explored for training.
Scalability tests across different model sizes are proposed.
Controlled instructability demonstrations show increased responsiveness to human guidance.
Abstract
Foundation models (FMs) face a critical safety challenge: as capabilities scale, instrumental convergence drives default trajectories toward loss of human control, potentially culminating in existential catastrophe. Current alignment approaches struggle with value specification complexity and fail to address emergent power-seeking behaviors. We propose "Corrigibility as a Singular Target" (CAST)-designing FMs whose overriding objective is empowering designated human principals to guide, correct, and control them. This paradigm shift from static value-loading to dynamic human empowerment transforms instrumental drives: self-preservation serves only to maintain the principal's control; goal modification becomes facilitating principal guidance. We present a comprehensive empirical research agenda spanning training methodologies (RLAIF, SFT, synthetic data generation), scalability testing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods · Geological Modeling and Analysis
MethodsShrink and Fine-Tune
