SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization
Yue Huang, Xiangqi Wang, Xiangliang Zhang

TL;DR
This paper introduces Self-Priority Alignment (SPA), a novel unsupervised framework that enforces a strict trustworthiness-before-helpfulness order in LLMs, improving safety and helpfulness in high-stakes scenarios.
Contribution
SPA is the first unsupervised method to implement priority alignment, generating and refining responses to ensure safety before helpfulness in LLMs.
Findings
SPA outperforms strong baselines in helpfulness and safety.
SPA maintains general capabilities while improving alignment.
The framework is scalable and interpretable.
Abstract
In high-stakes scenarios-such as self-harm, legal, or medical queries-LLMs must be both trustworthy and helpful. However, these goals often conflict. We propose priority alignment, a new alignment paradigm that enforces a strict "trustworthy-before-helpful" ordering: optimization of helpfulness is conditioned on first meeting trustworthy thresholds (e.g., harmlessness or honesty). To realize this, we introduce Self-Priority Alignment (SPA)-a fully unsupervised framework that generates diverse responses, self-evaluates them and refines them by the model itself, and applies dual-criterion denoising to remove inconsistency and control variance. From this, SPA constructs lexicographically ordered preference pairs and fine-tunes the model using an uncertainty-weighted alignment loss that emphasizes high-confidence, high-gap decisions. Experiments across multiple benchmarks show that SPA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsData Quality and Management · Topic Modeling · Semantic Web and Ontologies
