QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA
Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou

TL;DR
QA-LIGN introduces a structured, interpretable reward system for aligning large language models, significantly improving safety and helpfulness while maintaining performance, through a transparent critique and revision process.
Contribution
It presents a novel approach that decomposes alignment rewards into interpretable principles using natural language programs, enhancing transparency and effectiveness in LLM alignment.
Findings
Reduces attack success rates by up to 68.7%
Maintains a low false refusal rate of 0.67%
Outperforms DPO and GRPO with state-of-the-art reward models
Abstract
Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise pipeline, where symbolic evaluation against the rubrics provides transparent feedback for both initial and revised responses during GRPO training. Applied to uncensored Llama-3.1-8B-Instruct, QA-LIGN reduces attack success rates by up to 68.7% while maintaining a 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming both DPO and GRPO with state-of-the-art reward models given equivalent training. These results demonstrate that making reward signals…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDirect Preference Optimization · Sparse Evolutionary Training
