A Theoretical Understanding of Self-Correction through In-context   Alignment

Yifei Wang; Yuyang Wu; Zeming Wei; Stefanie Jegelka; Yisen Wang

arXiv:2405.18634·cs.LG·November 19, 2024·3 cites

A Theoretical Understanding of Self-Correction through In-context Alignment

Yifei Wang, Yuyang Wu, Zeming Wei, Stefanie Jegelka, Yisen Wang

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of how large language models can self-correct responses through in-context learning, highlighting key transformer components and validating findings with synthetic data.

Contribution

It offers a theoretical framework explaining the emergence of self-correction in LLMs and identifies the roles of transformer design elements in this process.

Findings

01

Self-correction improves response quality when LLMs give accurate self-examinations.

02

Key transformer components like softmax attention and multi-head attention facilitate self-correction.

03

Self-correction can be applied to defend against LLM jailbreaks.

Abstract

Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, in certain circumstances. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Notably, going beyond previous theories on over-simplified linear transformers, our theoretical construction underpins the roles of several key designs of realistic transformers for self-correction: softmax attention, multi-head attention, and the MLP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimedia Communication and Technology

MethodsSoftmax