PopAlign: Diversifying Contrasting Patterns for a More Comprehensive   Alignment

Zekun Moore Wang; Shawn Wang; Kang Zhu; Jiaheng Liu; Ke Xu; Jie Fu,; Wangchunshu Zhou; Wenhao Huang

arXiv:2410.13785·cs.CL·October 18, 2024

PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment

Zekun Moore Wang, Shawn Wang, Kang Zhu, Jiaheng Liu, Ke Xu, Jie Fu,, Wangchunshu Zhou, Wenhao Huang

PDF

Open Access 1 Video

TL;DR

PopAlign introduces diversified contrasting patterns at multiple levels to improve large language model alignment, addressing limitations of traditional methods and enhancing robustness against jailbreaking.

Contribution

It proposes a novel framework, PopAlign, that diversifies contrasting patterns without extra feedback, significantly improving alignment comprehensiveness and robustness.

Findings

01

PopAlign outperforms existing methods in alignment quality.

02

Diversified contrasting patterns enhance model robustness.

03

Framework does not require additional feedback labeling.

Abstract

Alignment of large language models (LLMs) involves training models on preference-contrastive output pairs to adjust their responses according to human preferences. To obtain such contrastive pairs, traditional methods like RLHF and RLAIF rely on limited contrasting patterns, such as varying model variants or decoding temperatures. This singularity leads to two issues: (1) alignment is not comprehensive; and thereby (2) models are susceptible to jailbreaking attacks. To address these issues, we investigate how to construct more comprehensive and diversified contrasting patterns to enhance preference data (RQ1) and verify the impact of the diversification of contrasting patterns on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment· underline

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Recommender Systems and Techniques

MethodsReinforcement Learning from AI Feedback