Evaluating Stability of Unreflective Alignment
James Lucassen, Mark Henry, Philippa Wright, Owen Yeung

TL;DR
This paper investigates the potential for future large language models to develop reflective stability issues through a proposed destabilization mechanism, highlighting risks associated with increased scale and capability.
Contribution
It introduces Counterfactual Priority Change destabilization as a new mechanism for reflective stability problems in LLMs and provides preliminary evaluations of associated risk factors.
Findings
Increased scale correlates with higher risk factors.
CPC-destabilization may cause stability issues in future LLMs.
Preliminary evaluations suggest emerging risks with larger models.
Abstract
Many theoretical obstacles to AI alignment are consequences of reflective stability - the problem of designing alignment mechanisms that the AI would not disable if given the option. However, problems stemming from reflective stability are not obviously present in current LLMs, leading to disagreement over whether they will need to be solved to enable safe delegation of cognitive labor. In this paper, we propose Counterfactual Priority Change (CPC) destabilization as a mechanism by which reflective stability problems may arise in future LLMs. We describe two risk factors for CPC-destabilization: 1) CPC-based stepping back and 2) preference instability. We develop preliminary evaluations for each of these risk factors, and apply them to frontier LLMs. Our findings indicate that in current LLMs, increased scale and capability are associated with increases in both CPC-based stepping back…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction
