OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

Guan Wang; Sijie Cheng; Xianyuan Zhan; Xiangang Li; Sen Song; Yang Liu

arXiv:2309.11235·cs.CL·March 19, 2024·26 cites

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, Yang Liu

PDF

Open Access 1 Repo 10 Models 3 Reviews

TL;DR

OpenChat introduces a novel framework for fine-tuning open-source language models using mixed-quality data by leveraging class-conditioned reinforcement learning, achieving superior performance without costly preference labeling.

Contribution

The paper proposes C-RLFT, a class-conditioned reinforcement learning approach that effectively utilizes mixed-quality data for open-source language model fine-tuning, avoiding expensive human preference labels.

Findings

01

OpenChat-13b outperforms other 13b open-source models on benchmarks.

02

The model generalizes well, surpassing the base model on AGIEval.

03

C-RLFT is lightweight and avoids costly human preference labeling.

Abstract

Nowadays, open-source large language models like LLaMA have emerged. Recent developments have incorporated supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) to align these models with human goals. However, SFT methods treat all training data with mixed quality equally, while RLFT methods require high-quality pairwise or ranking-based preference data. In this study, we present a novel framework, named OpenChat, to advance open-source language models with mixed-quality data. Specifically, we consider the general SFT training data, consisting of a small amount of expert data mixed with a large proportion of sub-optimal data, without any preference labels. We propose the C(onditioned)-RLFT, which regards different data sources as coarse-grained reward labels and learns a class-conditioned policy to leverage complementary data quality information. Interestingly, the…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The proposed method is easy and efficient to train. 2. The method has many practical application scenarios where data quality can't be controlled very well especially for language model training. 3. The experimental evaluation confirms the effectiveness of the proposed method.

Weaknesses

1. It requires the data quality to be divisible into two sets: good quality and average quality, which might potentially limit is applications.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- Originality: The paper introduces a new method, C-RLFT, that leverages coarse-grained rewards and class-conditioned policies to align the language model with human goals. The paper also provides a theoretical analysis and derivation of the optimal policy for C-RLFT. The resulting framework is a conditioned SFT with weighted loss, which is easy to implement and more stable than RLHF. - Quality: The paper presents evaluation results on three benchmarks to assess instruction following ability of

Weaknesses

- The paper claims the superiority of C-RLFT over RLFT. But the claim is only supported by openchat-13b, which distilled its knowledge from two RLFT models: GPT-4 and GPT-3.5. So, it is only fair to say C-RLFT is better than SFT when distilling from GPT models, which is based on the fact the openchat-13b is better than vicuna-13b-1.5. However, one cannot say C-RLFT is better than RLFT, because openchat-13b learns from models trained by RLFT. - It is not clear if the "low quality data" is even us

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The proposed method is emphasizing data quality, which is a critical aspect of the model alignment problem. The framework is straightforward and replicable, delivering significant improvements across various benchmark datasets. The evaluation is comprehensive, conducted against several established models and on a wide range of benchmark datasets, with the experimental design being thoroughly considered.

Weaknesses

The paper does not adequately address the challenge of estimating data quality when leveraging mixed-quality data sources. The method presented seems to oversimplify this estimation and the associated reinforcement learning (RL) reward mechanism: The process of data quality estimation may be too simplified and not easily generalizable to realistic settings involving human feedback. The estimation relies on the assumption that GPT-4 provides better quality outputs than GPT-3.5, and GPT-3.5 surpa

Code & Models

Repositories

imoneoi/openchat
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques

MethodsBalanced Selection · ALIGN