Aligner: Efficient Alignment by Learning to Correct
Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang,, Xuehai Pan, Juntao Dai, Tianyi Qiu, Yaodong Yang

TL;DR
Aligner introduces a simple, model-agnostic alignment method that learns to correct responses, enabling rapid deployment and iterative improvement of large language models with significant performance gains across multiple metrics.
Contribution
The paper presents Aligner, a novel plug-and-play alignment approach that learns residual corrections, applicable to various models with one-off training, and capable of iterative bootstrapping to surpass existing performance limits.
Findings
Aligner improves helpfulness by 68.9% on average across 11 LLMs.
Aligner reduces hallucination and enhances harmlessness effectively.
Stacking Aligner on GPT-4 Turbo increases its win rate to 58.3%.
Abstract
With the rapid development of large language models (LLMs) and ever-evolving practical requirements, finding an efficient and effective alignment method has never been more critical. However, the tension between the complexity of current alignment methods and the need for rapid iteration in deployment scenarios necessitates the development of a model-agnostic alignment approach that can operate under these constraints. In this paper, we introduce Aligner, a novel and simple alignment paradigm that learns the correctional residuals between preferred and dispreferred answers using a small model. Designed as a model-agnostic, plug-and-play module, Aligner can be directly applied to various open-source and API-based models with only one-off training, making it suitable for rapid iteration. Notably, Aligner can be applied to any powerful, large-scale upstream models. Moreover, it can even…
Peer Reviews
Decision·NeurIPS 2024 oral
1. The writing is very clear, and Aligner is imbued with the concept of Residual Correction, reminiscent of ResNet. 2. Aligner has the potential to play a role in multi-round RLHF training.
1. The training of Aligner requires a robust Teacher model and human annotations. More importantly, as compared by the authors in their experiments, works like CAI have already demonstrated the model's ability to self-improve. The authors need to emphasize the fundamental difference between collecting improvement signals from a broad range of preferences and using the model for self-improvement, explaining why the former can yield stronger results than self-improvement alone. 2. The authors hav
1. The paper proposes an interesting method which is Aligner to improve LLM alignment. 2. The paper has developed a model which can correct the model answers. The paper has conducted extensive experiments to demonstrate the effectiveness of Aligner and also discussed potential use cases of Aligner.
1. Developing Aligner should be much more expensive for data annotation: Aligner needs the human annotator to correct the response, which is unlike correcting preference feedback where the annotator only needs to judge which candidate answer is better. Collecting corrections should be much more expensive than collecting preference feedback. 2. LLMs can self-critic and self-correct their answers. For advanced language models such as GPT-4, do we need the aligner to help with the alignment?
originality: Seems Quite original as I haven't seen any paper that uses an aligner LLM to map corrected outputs , But seems plausible that a less well-known paper was already using such an alignment technique (or that a big AI lab is using this internally) given it's just a novel combination of known techniques and training schemes. quality: Experiment and analysis are of good quality. Comprehensive baselines of existing SOTA techniques compared, decent variety of robustness tests, intriguing s
- For the RLHF section, Sounds like you’re just transferring the OOD reward model collapse problem to the aligner module? Seems like you'd just run into the same problem if the aligner LM performs poorly on OOD inputs and generates poor synthetic data to train on - I'm no expert on RLHF reward collapse so correct me if I'm wrong Other Fairly minor weaknesses: - Concrete examples of the prompts and before/after alignment LM responses would possibly be helpful to have a clearer qualitative s
- The method is well-motivated and quite lightweight. As demonstrated in the experiments on multi-round RLHF, it can assist in iterative updating. - Extensive experiments are conducted on 11 different LLMs, various datasets, and both single-turn and multi-turn scenarios. - Interpretability experiments are performed, providing interesting insights.
The discussion regarding the out-of-domain (OOD) extent of training data to test data is not addressed. I am quite interested in how a trained Aligner performs on OOD datasets.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Color Science and Applications
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention
