Parameter-Efficient Tuning Helps Language Model Alignment
Tianci Xue, Ziqi Wang, Heng Ji

TL;DR
This paper introduces MEET, a parameter-efficient tuning method that enhances controllable generation in language models for better alignment with human preferences, addressing limitations of previous reinforcement learning and preference optimization methods.
Contribution
The paper proposes a novel approach combining parameter-efficient tuning with controllable generation to improve model alignment and flexibility over existing methods.
Findings
MEET outperforms prior methods on benchmark datasets.
Control tokens optimized with MEET lead to higher quality controllable outputs.
The approach enables learning multiple preferences simultaneously.
Abstract
Aligning large language models (LLMs) with human preferences is essential for safe and useful LLMs. Previous works mainly adopt reinforcement learning (RLHF) and direct preference optimization (DPO) with human feedback for alignment. Nevertheless, they have certain drawbacks. One such limitation is that they can only align models with one preference at the training time (e.g., they cannot learn to generate concise responses when the preference data prefers detailed responses), or have certain constraints for the data format (e.g., DPO only supports pairwise preference data). To this end, prior works incorporate controllable generations for alignment to make language models learn multiple preferences and provide outputs with different preferences during inference if asked. Controllable generation also offers more flexibility with regard to data format (e.g., it supports pointwise…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. The evaluation is thorough for the benchmarks considered, with 2 different evaluation metrics and ablation studies 2. The problem of controllable generation is important, allowing one to control model generation at inference time
1. One key ablation that is missing is doing stage 2 only (skipping the control token optimization) but starting with the CoH control tokens (or not even optimizing the CoH tokens at all). This would really elucidate the role of prefix optimization since if it is subsumed by CoH, it is not important that it is parameter-efficient (which is a central claim to the paper). 2. Training soft prompts before fine-tuning the model has been studied by the prior work of [promot](https://arxiv.org/abs/22
1. This paper studies a parameter-efficient way to improve the language alignment. It is an interesting direction to explore. 2. It studies several aspects of the proposed method such as prompt length, rank, and temperature.
1. This paper conducted several experiments. However, I don't think the baselines the paper compares with are sufficient. Several works focus on a similar idea about incorporating the reward into text learning, such as RLPrompt [1] and AutoPrompt [2]. Those should become the baselines to compare the method proposed in the paper. Also, For controllable text generation, there is an interesting direction to utilize the diffusion process, such as the Diffusion-LM [3]. However, none of these are incl
- The paper is well-written, the details of the experimental setting are clear. - The two-stage training procedure is interesting and its importance is validated by the ablation study. - Results seem to suggest that the two-step optimization method delivers gains w.r.t. DPO.
- It feels like the authors are a bit confused on where the novelty of their paper really lies, they seem to suggest that it is in using adapters to control generation, but imho, the interesting bit is more on the two-step training procedure that guarantees information is captured by the adapters and thus they are not "information-starved" by the full LM fine-tuning (easy to fix) - The more problematic bit is that authors' confusion seems to have affected the overall experimental methodology; f
* Results against public datasets are presented, with comparisons against some existing alignment baselines, notably DPO. * Ablation study showing impact of including either just the soft-prompt learning, or just the further LLM fine-tuning (given a fixed soft/static prompt).
* The paper blends topics, without good justification. It is focussed on alignment, and presents a valid question on whether alignment can be achieved via control codes (static, or trained, ie soft-prompt-learning). However it introduces parameter efficiency, and IMHO I see no rationale for how this relates. The question of how to align an LLM can be addressed separately to requiring the learnt control codes to come from LORA adapted models or otherwise. * Despite criticising some existing alig
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Speech and dialogue systems
MethodsDirect Preference Optimization · ALIGN
