SALMON: Self-Alignment with Instructable Reward Models
Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen,, David Cox, Yiming Yang, Chuang Gan

TL;DR
SALMON introduces a method to align large language models with minimal human supervision by using an instructable reward model trained on synthetic data, enabling controllable and scalable AI behavior without extensive human annotations.
Contribution
The paper presents SALMON, a novel approach that reduces reliance on human annotations by using a synthetic preference-based reward model for effective LLM alignment.
Findings
Dromedary-2 outperforms state-of-the-art models on benchmarks.
Achieves high alignment quality with only 6 exemplars and 31 principles.
Reduces need for online human preference collection.
Abstract
Supervised Fine-Tuning (SFT) on response demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful paradigm for aligning LLM-based AI agents. However, a significant limitation of such an approach is its dependency on high-quality human annotations, making its application to intricate tasks challenging due to difficulties in obtaining consistent response demonstrations and in-distribution response preferences. This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision, using only a small set of human-defined principles, yet achieving superior performance. Central to our approach is an instructable reward model. Trained on synthetic preference data, this model can generate reward scores based on arbitrary human-defined principles. By merely adjusting these principles during the RL training…
Peer Reviews
Decision·ICLR 2024 poster
- The methods that the authors propose only need to change the way of generating synthetic data, without much of modification of following RLHF procedure, which makes the technique more general and easy to adapt to other tasks - The methods can be quite helpful when we need more domain-specific preference data (e.g., code, agents) when there is no such public available data. - The authors demonstrate the advantage of the new method by finetuning with QLORA on 70B models, demonstrating its abilit
- It would be good to show that the methods can also be leveraged to improve the performance of smaller models such as 7B or 33B, making the method easier for other topics or tasks. - I believe this method could potentially be adapted to some other tasks such as code generation. But I am not sure if it is possible, it would be good if the authors could comment on this.
- The paper is generally well-written, though addressing some questions related to preference collection should improve the clarity further. - A relevant and timely problem to address. Preference data needs to be extensively collected to keep reward models in-distribution with the current RL policy. - The performance of the model is impressive, and the recipe for AI feedback seems quite interesting.
- Some lack of novelty compared to Constitutional AI; The paper emphasizes constitutional AI focuses more on safety, but the technique itself is very much amenable for building a more “helpful” constitution too. But, the system laid down is distinct enough to warrant interest from the community. - The paper claims that using principles to avoid reward hacking. Perhaps, the work “reward hacking” is a bit overloaded, but I don’t see any reason that SALMON rewards cannot be hacked to give degenera
- S1. First of all, this paper well-written and well-organized. - S2. It is very interesting that SALMON (one of RLAIF methods) can significantly reduce human annotation costs than a prevalent RLHF method. - S3. Unlike other RLAIF methods, SALMON can control preference scores by using a principle-following reward model (i.e., changing a principle to follow).
- W1. One of main contributions of this paper is a principle-following reward model that can control reward scores according to principles. In addition to the overall alignment scores, can the authors measure a quantitative result of the principle-following reward model? - W2. Even though Llama-2-70B with SALMON can provide better alignment score (7.4 MT-Bench score) than Llama-2-70B with RLHF (PPO) (6.9), there is still large gap to GPT-4 (9.0) and ChatGPT (7.9). - W3. This paper compares SAL
Code & Models
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Reinforcement Learning in Robotics
MethodsBalanced Selection · ALIGN
