Data-Juicer: A One-Stop Data Processing System for Large Language Models
Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge,, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding,, Jingren Zhou

TL;DR
Data-Juicer is a comprehensive system designed to efficiently generate, evaluate, and optimize diverse data recipes for training large language models, significantly enhancing their performance across multiple benchmarks.
Contribution
The paper introduces Data-Juicer, a flexible, extensible system with visualization and auto-evaluation features for creating and assessing data recipes tailored for LLM training.
Findings
Up to 7.45% improvement in LLM benchmark scores
17.5% higher win rate in GPT-4 pairwise evaluations
Enables exploration of diverse data mixtures for LLMs
Abstract
The immense evolution in Large Language Models (LLMs) has underscored the importance of massive, heterogeneous, and high-quality data. A data recipe is a mixture of data from different sources for training LLMs, which plays a vital role in LLMs' performance. Existing open-source tools for LLM data processing are mostly tailored for specific data recipes. To continuously uncover the potential of LLMs, incorporate data from new sources, and improve LLMs' performance, we build a new system named Data-Juicer, with which we can efficiently generate diverse data recipes, explore different possibilities in forming data mixtures, and evaluate their effects on model performance. Different from traditional data-analytics pipelines, Data-Juicer faces some unique challenges. Firstly, the possible data sources for forming data recipes are truly heterogeneous and massive with various qualities.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗datajuicer/LLaMA-1B-dj-refine-50Bmodel· 4 dl4 dl
- 🤗datajuicer/LLaMA-1B-dj-refine-100Bmodel· 6 dl6 dl
- 🤗datajuicer/LLaMA-1B-dj-refine-150Bmodel· 389k dl· ♡ 3389k dl♡ 3
- 🤗datajuicer/LLaMA-1B-dj-refine-150B-instruct-4.7Bmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗datajuicer/LLaMA-7B-EN-Chat-40kmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗datajuicer/LLaMA2-7B-ZH-Chat-52kmodel· 8 dl· ♡ 18 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Byte Pair Encoding · Softmax · Dropout · Label Smoothing · Absolute Position Encodings
