OpenChat: Advancing Open-source Language Models with Mixed-Quality Data
Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, Yang Liu

TL;DR
OpenChat introduces a novel framework for fine-tuning open-source language models using mixed-quality data by leveraging class-conditioned reinforcement learning, achieving superior performance without costly preference labeling.
Contribution
The paper proposes C-RLFT, a class-conditioned reinforcement learning approach that effectively utilizes mixed-quality data for open-source language model fine-tuning, avoiding expensive human preference labels.
Findings
OpenChat-13b outperforms other 13b open-source models on benchmarks.
The model generalizes well, surpassing the base model on AGIEval.
C-RLFT is lightweight and avoids costly human preference labeling.
Abstract
Nowadays, open-source large language models like LLaMA have emerged. Recent developments have incorporated supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) to align these models with human goals. However, SFT methods treat all training data with mixed quality equally, while RLFT methods require high-quality pairwise or ranking-based preference data. In this study, we present a novel framework, named OpenChat, to advance open-source language models with mixed-quality data. Specifically, we consider the general SFT training data, consisting of a small amount of expert data mixed with a large proportion of sub-optimal data, without any preference labels. We propose the C(onditioned)-RLFT, which regards different data sources as coarse-grained reward labels and learns a class-conditioned policy to leverage complementary data quality information. Interestingly, the…
Peer Reviews
Decision·ICLR 2024 poster
1. The proposed method is easy and efficient to train. 2. The method has many practical application scenarios where data quality can't be controlled very well especially for language model training. 3. The experimental evaluation confirms the effectiveness of the proposed method.
1. It requires the data quality to be divisible into two sets: good quality and average quality, which might potentially limit is applications.
- Originality: The paper introduces a new method, C-RLFT, that leverages coarse-grained rewards and class-conditioned policies to align the language model with human goals. The paper also provides a theoretical analysis and derivation of the optimal policy for C-RLFT. The resulting framework is a conditioned SFT with weighted loss, which is easy to implement and more stable than RLHF. - Quality: The paper presents evaluation results on three benchmarks to assess instruction following ability of
- The paper claims the superiority of C-RLFT over RLFT. But the claim is only supported by openchat-13b, which distilled its knowledge from two RLFT models: GPT-4 and GPT-3.5. So, it is only fair to say C-RLFT is better than SFT when distilling from GPT models, which is based on the fact the openchat-13b is better than vicuna-13b-1.5. However, one cannot say C-RLFT is better than RLFT, because openchat-13b learns from models trained by RLFT. - It is not clear if the "low quality data" is even us
The proposed method is emphasizing data quality, which is a critical aspect of the model alignment problem. The framework is straightforward and replicable, delivering significant improvements across various benchmark datasets. The evaluation is comprehensive, conducted against several established models and on a wide range of benchmark datasets, with the experimental design being thoroughly considered.
The paper does not adequately address the challenge of estimating data quality when leveraging mixed-quality data sources. The method presented seems to oversimplify this estimation and the associated reinforcement learning (RL) reward mechanism: The process of data quality estimation may be too simplified and not easily generalizable to realistic settings involving human feedback. The estimation relies on the assumption that GPT-4 provides better quality outputs than GPT-3.5, and GPT-3.5 surpa
Code & Models
- 🤗openchat/openchat_v3.1model· 828 dl· ♡ 6828 dl♡ 6
- 🤗openchat/openchat_v3.2model· 849 dl· ♡ 42849 dl♡ 42
- 🤗openchat/openchat_v3.2_supermodel· 818 dl· ♡ 34818 dl♡ 34
- 🤗openchat/openchat_3.5model· 2.6k dl· ♡ 11402.6k dl♡ 1140
- 🤗TheBloke/openchat_3.5-AWQmodel· 111 dl· ♡ 15111 dl♡ 15
- 🤗TheBloke/openchat_3.5-GGUFmodel· 1.8k dl· ♡ 1291.8k dl♡ 129
- 🤗LoneStriker/openchat_3.5-3.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/openchat_3.5-4.0bpw-h6-exl2model· 2 dl· ♡ 12 dl♡ 1
- 🤗LoneStriker/openchat_3.5-5.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/openchat_3.5-6.0bpw-h6-exl2model· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques
MethodsBalanced Selection · ALIGN
