AI Alignment: A Comprehensive Survey

Jiaming Ji; Tianyi Qiu; Boyuan Chen; Borong Zhang; Hantao Lou; Kaile; Wang; Yawen Duan; Zhonghao He; Lukas Vierling; Donghai Hong; Jiayi Zhou,; Zhaowei Zhang; Fanzhi Zeng; Juntao Dai; Xuehai Pan; Kwan Yee Ng; Aidan; O'Gara; Hua Xu; Brian Tse; Jie Fu; Stephen McAleer; Yaodong Yang; Yizhou; Wang; Song-Chun Zhu; Yike Guo; Wen Gao

arXiv:2310.19852·cs.AI·April 7, 2025·67 cites

AI Alignment: A Comprehensive Survey

Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile, Wang, Yawen Duan, Zhonghao He, Lukas Vierling, Donghai Hong, Jiayi Zhou,, Zhaowei Zhang, Fanzhi Zeng, Juntao Dai, Xuehai Pan, Kwan Yee Ng, Aidan, O'Gara, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer

PDF

Open Access

TL;DR

This survey provides a comprehensive overview of AI alignment, discussing core principles, research landscape, and techniques for ensuring AI systems behave in line with human values, emphasizing robustness, interpretability, controllability, and ethicality.

Contribution

It introduces a structured framework for AI alignment based on four principles and decomposes the field into forward and backward alignment, summarizing current research and practices.

Findings

01

Identifies four key principles: RICE.

02

Outlines techniques for learning from feedback and governance.

03

Provides a curated resource website for ongoing updates.

Abstract

AI alignment aims to make AI systems behave in line with human intentions and values. As AI systems grow more capable, so do risks from misalignment. To provide a comprehensive and up-to-date overview of the alignment field, in this survey, we delve into the core concepts, methodology, and practice of alignment. First, we identify four principles as the key objectives of AI alignment: Robustness, Interpretability, Controllability, and Ethicality (RICE). Guided by these four principles, we outline the landscape of current alignment research and decompose them into two key components: forward alignment and backward alignment. The former aims to make AI systems aligned via alignment training, while the latter aims to gain evidence about the systems' alignment and govern them appropriately to avoid exacerbating misalignment risks. On forward alignment, we discuss techniques for learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research