Large Language Model Alignment: A Survey
Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong,, Zishan Guo, Xinwei Wu, Yan Liu, Deyi Xiong

TL;DR
This survey comprehensively reviews methods for aligning large language models with human values, discussing techniques, challenges, benchmarks, and future research directions to ensure safer and more reliable AI systems.
Contribution
It categorizes existing alignment methods into outer and inner alignment, and explores interpretability, vulnerabilities, and evaluation benchmarks for LLMs.
Findings
Overview of alignment techniques and their categorization
Discussion of interpretability and adversarial vulnerabilities
Summary of benchmarks and evaluation methodologies
Abstract
Recent years have witnessed remarkable progress made in large language models (LLMs). Such advancements, while garnering significant attention, have concurrently elicited various concerns. The potential of these models is undeniably vast; however, they may yield texts that are imprecise, misleading, or even detrimental. Consequently, it becomes paramount to employ alignment techniques to ensure these models to exhibit behaviors consistent with human values. This survey endeavors to furnish an extensive exploration of alignment methodologies designed for LLMs, in conjunction with the extant capability research in this domain. Adopting the lens of AI alignment, we categorize the prevailing methods and emergent proposals for the alignment of LLMs into outer and inner alignment. We also probe into salient issues including the models' interpretability, and potential vulnerabilities to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Manhattan Project for AI Safety [Connor Leahy]· youtube
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling
