A Theoretical Understanding of Self-Correction through In-context Alignment
Yifei Wang, Yuyang Wu, Zeming Wei, Stefanie Jegelka, Yisen Wang

TL;DR
This paper provides a theoretical analysis of how large language models can self-correct responses through in-context learning, highlighting key transformer components and validating findings with synthetic data.
Contribution
It offers a theoretical framework explaining the emergence of self-correction in LLMs and identifies the roles of transformer design elements in this process.
Findings
Self-correction improves response quality when LLMs give accurate self-examinations.
Key transformer components like softmax attention and multi-head attention facilitate self-correction.
Self-correction can be applied to defend against LLM jailbreaks.
Abstract
Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, in certain circumstances. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Notably, going beyond previous theories on over-simplified linear transformers, our theoretical construction underpins the roles of several key designs of realistic transformers for self-correction: softmax attention, multi-head attention, and the MLP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimedia Communication and Technology
MethodsSoftmax
