Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF
Chen Zheng, Ke Sun, Hang Wu, Chenguang Xi, Xun Zhou

TL;DR
This paper introduces a novel approach to improve conversational LLMs by directly applying Harmless RLHF, bypassing traditional supervised fine-tuning, which preserves capabilities and reduces toxicity, demonstrated on the Mistral model.
Contribution
The paper proposes a direct RLHF method that enhances conversational abilities and safety of LLMs without the drawbacks of supervised fine-tuning, applied to the Mistral model.
Findings
Mistral-Plus outperforms similar-sized open-source models.
Significant improvement in conversational abilities.
Reduced toxic output generation.
Abstract
In recent advancements in Conversational Large Language Models (LLMs), a concerning trend has emerged, showing that many new base LLMs experience a knowledge reduction in their foundational capabilities following Supervised Fine-Tuning (SFT). This process often leads to issues such as forgetting or a decrease in the base model's abilities. Moreover, fine-tuned models struggle to align with user preferences, inadvertently increasing the generation of toxic outputs when specifically prompted. To overcome these challenges, we adopted an innovative approach by completely bypassing SFT and directly implementing Harmless Reinforcement Learning from Human Feedback (RLHF). Our method not only preserves the base model's general capabilities but also significantly enhances its conversational abilities, while notably reducing the generation of toxic outputs. Our approach holds significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling
Methodstravel james · Shrink and Fine-Tune · ALIGN · Balanced Selection
