Negative-Prompt-driven Alignment for Generative Language Model
Shiqi Qiao, Ning Xv, Biao Liu, Xin Geng

TL;DR
This paper introduces NEAT, a novel alignment method for language models that uses negative prompts to explicitly discourage undesirable outputs, improving alignment with human values.
Contribution
NEAT is the first approach to incorporate negative prompts explicitly in the alignment process, guiding models away from harmful responses during training.
Findings
NEAT significantly reduces harmful and biased outputs.
Models trained with NEAT better align with human preferences.
Enhanced safety and ethical compliance in language generation.
Abstract
Large language models have achieved remarkable capabilities, but aligning their outputs with human values and preferences remains a significant challenge. Existing alignment methods primarily focus on positive examples while overlooking the importance of negative responses in guiding models away from undesirable behaviors. For instance, the widely-used alignment datasets reveals a scarcity of explicit negative examples that contradict human values, hindering its ability to discourage harmful or biased outputs during training. To address this limitation, we propose NEAT, i.e., NEgative-prompt-driven AlignmenT, to introduce negative prompts to generate undesirable responses alongside positive examples during the optimization process. NEAT explicitly penalizes the model for producing harmful outputs, guiding it not only toward desirable behaviors but also steering it away from generating…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. This paper is easy to follow though the writing needs to be improved. Some content arrangements are not appropriate, for example, NEAT-PP variant is not well explained. 2. The proposed method is easy to implement and the results show that it can help the LLM align better.
1. Lack of the novelty. The claimed "focusing on penalizing negative responses" has been already explored[1]. The claimed "online sampling" is not really "online", it is with two stages: sampling and training, which is similar to many alignment methods[2,3,4]. The combined training loss contributes little to this field. 2. Experimental results are very limited. This paper only uses one dataset. This paper only involves two metrics: PPL and reward score. Please refer to existing papers to conduct
The paper writing is fluent.
1. There is a highly relevant work: "Negating Negatives: Alignment with Human Negative Samples via Distributional Dispreference Optimization". This paper also uses negative responses to guide LLMs to avoid generating harmful responses while maintaining helpfulness. The idea is highly similar with this paper, but about one year ahead. 2. Experiments conducted in this paper is limited, making the effectiveness of the proposed approach not convincing.
The paper presentation is clear and effective, with a compelling problem statement that highlights gaps in existing alignment methods. NEAT is introduced through a well-explained, novel approach involving negative prompts and a dual feedback mechanism. The detailed methodology, structured experimental setup, and use of visual aids ensure clarity, making the proposed solution and findings accessible and easy to follow.
1. Limitation of Applied Scenario: The effectiveness of NEAT relies heavily on the quality of the reward function used for penalizing negative outputs, which can be challenging to train and optimize effectively. This dependency limits the robustness and general applicability of the approach. 2. Lack of Benchmark Dataset: All experiments in the paper are conducted solely on the HH-RLHF dataset. This limits the generalizability of the findings and raises concerns about whether NEAT can outperform
1. The problem of aligning LLMs with human preferences is important. 2. The authors provide figures and a pseudo-code to illustrate the proposed method.
My concerns are as follows. 1. The motivation of this paper is unclear. In Lines 13-15 of Abstract, the authors claim that “existing alignment methods primarily focus on positive examples while overlooking the importance of negative responses in guiding models away from undesirable behaviors.” However, many preference alignment methods [1,2,3,4,5,6,7,8] use ranking-based loss to guide the model away from undesirable behaviors. 2. This paper uses several important concepts in a confusing way. I a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsFocus · Neural Attention Fields
