Evaluating Stability of Unreflective Alignment

James Lucassen; Mark Henry; Philippa Wright; Owen Yeung

arXiv:2408.15116·cs.AI·August 28, 2024

Evaluating Stability of Unreflective Alignment

James Lucassen, Mark Henry, Philippa Wright, Owen Yeung

PDF

Open Access

TL;DR

This paper investigates the potential for future large language models to develop reflective stability issues through a proposed destabilization mechanism, highlighting risks associated with increased scale and capability.

Contribution

It introduces Counterfactual Priority Change destabilization as a new mechanism for reflective stability problems in LLMs and provides preliminary evaluations of associated risk factors.

Findings

01

Increased scale correlates with higher risk factors.

02

CPC-destabilization may cause stability issues in future LLMs.

03

Preliminary evaluations suggest emerging risks with larger models.

Abstract

Many theoretical obstacles to AI alignment are consequences of reflective stability - the problem of designing alignment mechanisms that the AI would not disable if given the option. However, problems stemming from reflective stability are not obviously present in current LLMs, leading to disagreement over whether they will need to be solved to enable safe delegation of cognitive labor. In this paper, we propose Counterfactual Priority Change (CPC) destabilization as a mechanism by which reflective stability problems may arise in future LLMs. We describe two risk factors for CPC-destabilization: 1) CPC-based stepping back and 2) preference instability. We develop preliminary evaluations for each of these risk factors, and apply them to frontier LLMs. Our findings indicate that in current LLMs, increased scale and capability are associated with increases in both CPC-based stepping back…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction