Is Free Self-Alignment Possible?
Dyah Adila, Changho Shin, Yijing Zhang, Frederic Sala

TL;DR
AlignEZ introduces a cost-effective, internal model-based method for aligning pretrained language models using self-generated data and representation editing, reducing reliance on large datasets and computational resources.
Contribution
This paper presents AlignEZ, a novel approach that achieves efficient, multi-objective alignment of language models without additional training or external preference data.
Findings
Improves alignment performance by up to 19.9% on general tasks
Enhances mathematical reasoning accuracy by 1.9%
Enables multi-objective alignment and accelerates traditional methods
Abstract
Aligning pretrained language models (LMs) often requires large-scale preference data and substantial computational resources. These costs become even more prohibitive for multi-objective or pluralistic alignment. Is this truly necessary? Can we perform efficient alignment using only internal model capabilities, and without additional training? To answer this question, we propose AlignEZ, a novel approach that leverages (1) self-generated preference data and (2) representation editing to achieve cost-effective, efficient alignment. By operating directly on learned representations, AlignEZ independently targets different behavioral aspects without the overhead of traditional alignment methods. Our experiments reveal that this cost-efficient procedure improves performance across diverse tasks: up to 19.9% on general alignment and 1.9% on challenging mathematical reasoning tasks, even when…
Peer Reviews
Decision·Submitted to ICLR 2025
1. By eliminating reliance on costly fine-tuning and human-annotated preference data, AlignEZ offers a more resource-efficient alternative to traditional alignment techniques, making it potentially scalable for larger or real-time applications. 2. AlignEZ presents a conceptually straightforward approach, leveraging self-generated preference data and representation editing to achieve alignment without the need for extensive fine-tuning or additional ground-truth annotations. 3. Empirically, the
1. While AlignEZ combines self-generated preference data and representation editing effectively, both of these approaches are well-established methods. Techniques for generating synthetic preference data have been extensively studied in self-alignment literature, which limits the technical novelty of this work. To strengthen its contribution, the authors could further emphasize unique aspects of their integration of these techniques or explore additional novel dimensions, such as adaptive mechan
1. The paper is well-written and easy to follow. 2. The performance results presented, particularly in Table 1, demonstrate the effectiveness of AlignEZ. As an inference-time alignment technique, AlignEZ significantly improves the alignment recovery compared to other test-time alignment baselines. 3. The authors conduct additional experiments to confirm that their method is orthogonal to prompting techniques such as URIAL. This enhances the practicality and robustness of the proposed approach.
1. The comparison with the baselines appears somewhat unfair. While the baseline test-time alignment techniques utilize ground-truth preference signals, these signals are sourced from a different dataset (HH-RLHF), introducing a distribution shift compared to the data used in AlignEZ. To provide a more balanced comparison, I suggest evaluating AlignEZ using the same HH-RLHF data during inference. 2. Concerning practicality and efficiency, the authors do not provide details on the overhead induc
1. Self-Generated Data: ALIGNEZ relies solely on self-generated data from the language model, eliminating the need for additional manual annotation, which is promising for scalability. 2. Performance Improvement: Experimental results show that this alignment approach effectively reduces the performance gap compared to traditional training-based methods, without requiring any additional training. 3. Data Efficiency: By using only 1% of preference data for DPO training, ALIGNEZ achieves results
1. Unclear Methodology: The description of the method's details is vague, particularly regarding the origin of queries used for generating self-generated preference data. Additionally, it's unclear which dataset was used for the experiments shown in Figures 3 and 4, and whether the reported $\Delta%$ represents an average score across multiple datasets or results from a single dataset. 2. Dependence on base LM: The base LM may generate incorrect preference data, and the paper does not clarify h
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCorporate Governance and Law
MethodsDirect Preference Optimization · Balanced Selection · ALIGN
