Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges

Haoran Lu; Luyang Fang; Ruidong Zhang; Xinliang Li; Jiazhang Cai; Huimin Cheng; Lin Tang; Ziyu Liu; Zeliang Sun; Tao Wang; Yingchuan Zhang; Arif Hassan Zidan; Jinwen Xu; Jincheng Yu; Meizhi Yu; Hanqi Jiang; Xilin Gong; Weidi Luo; Bolun Sun; Yongkai Chen; Terry Ma; Shushan Wu; Yifan Zhou; Junhao Chen; Haotian Xiang; Jing Zhang; Afrar Jahin; Wei Ruan; Ke Deng; Yi Pan; Peilong Wang; Jiahui Li; Zhengliang Liu; Lu Zhang; Lin Zhao; Wei Liu; Dajiang Zhu; Xin Xing; Fei Dou; Wei Zhang; Chao Huang; Rongjie Liu; Mengrui Zhang; Yiwen Liu; Xiaoxiao Sun; Qin Lu; Zhen Xiang; Wenxuan Zhong; Tianming Liu; Ping Ma

arXiv:2507.19672·cs.AI·July 29, 2025

Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges

Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Jiazhang Cai, Huimin Cheng, Lin Tang, Ziyu Liu, Zeliang Sun, Tao Wang, Yingchuan Zhang, Arif Hassan Zidan, Jinwen Xu, Jincheng Yu, Meizhi Yu, Hanqi Jiang, Xilin Gong, Weidi Luo, Bolun Sun, Yongkai Chen, Terry Ma, Shushan Wu

PDF

TL;DR

This survey reviews the current landscape of large language model alignment, discussing techniques, challenges, and evaluation methods to ensure models align with human values and intentions.

Contribution

It provides a comprehensive overview of alignment methods, paradigms, and evaluation frameworks, highlighting recent advances and open challenges in LLM alignment.

Findings

01

Supervised fine-tuning enables basic instruction-following.

02

Preference-based methods offer nuanced alignment with human intent.

03

Current evaluation frameworks face limitations like reward misspecification.

Abstract

Due to the remarkable capabilities and growing impact of large language models (LLMs), they have been deeply integrated into many aspects of society. Thus, ensuring their alignment with human values and intentions has emerged as a critical challenge. This survey provides a comprehensive overview of practical alignment techniques, training protocols, and empirical findings in LLM alignment. We analyze the development of alignment methods across diverse paradigms, characterizing the fundamental trade-offs between core alignment objectives. Our analysis shows that while supervised fine-tuning enables basic instruction-following, preference-based methods offer more flexibility for aligning with nuanced human intent. We discuss state-of-the-art techniques, including Direct Preference Optimization (DPO), Constitutional AI, brain-inspired methods, and alignment uncertainty quantification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.