Enhancing Chat Language Models by Scaling High-quality Instructional   Conversations

Ning Ding; Yulin Chen; Bokai Xu; Yujia Qin; Zhi Zheng; Shengding Hu,; Zhiyuan Liu; Maosong Sun; Bowen Zhou

arXiv:2305.14233·cs.CL·May 24, 2023·6 cites

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu,, Zhiyuan Liu, Maosong Sun, Bowen Zhou

PDF

Open Access 1 Repo 10 Models 5 Datasets

TL;DR

This paper introduces UltraChat, a large-scale, diverse dataset of instructional conversations, and uses it to fine-tune a LLaMA-based model, UltraLLaMA, which surpasses existing open-source chat models in performance.

Contribution

The paper presents UltraChat, a comprehensive dataset of 1.5 million multi-turn dialogues, and demonstrates its effectiveness by fine-tuning UltraLLaMA, achieving state-of-the-art results among open-source models.

Findings

01

UltraChat contains 1.5 million high-quality dialogues.

02

UltraLLaMA outperforms other open-source chat models like Vicuna.

03

UltraChat's diversity and scale improve model performance.

Abstract

Fine-tuning on instruction data has been widely validated as an effective practice for implementing chat language models like ChatGPT. Scaling the diversity and quality of such data, although straightforward, stands a great chance of leading to improved performance. This paper aims to improve the upper bound of open-source models further. We first provide a systematically designed, diverse, informative, large-scale dataset of instructional conversations, UltraChat, which does not involve human queries. Our objective is to capture the breadth of interactions that a human might have with an AI assistant and employs a comprehensive framework to generate multi-turn conversation iteratively. UltraChat contains 1.5 million high-quality multi-turn dialogues and covers a wide range of topics and instructions. Our statistical analysis of UltraChat reveals its superiority in various key metrics,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/ultrachat
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques