LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang,, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E., Gonzalez, Ion Stoica, Hao Zhang

TL;DR
LMSYS-Chat-1M is a large, diverse dataset of one million real-world conversations with state-of-the-art LLMs, enabling research in safety, instruction-following, and benchmarking.
Contribution
The paper introduces LMSYS-Chat-1M, a comprehensive dataset of real-world LLM conversations, with detailed curation and multiple use cases for advancing LLM research.
Findings
Content moderation models comparable to GPT-4
Safety benchmark development
Instruction-following models similar to Vicuna
Abstract
Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. We believe that this dataset…
Peer Reviews
Decision·ICLR 2024 spotlight
1. With a size of 1 million, this dataset stands out as unparalleled among its peers. 2. This dataset offers authentic user prompts and highlights potential safety concerns, paving the way for future research. 3. This paper presents studies that highlight the four distinct applications of this novel dataset. 4. The authors have a plan to regularly update the dataset in the future, which is a crucial step given the frequent release of newer and more advanced LLMs from the community.
While the user prompts in this dataset are genuine, the responses are synthetic and do not have quality ratings. Thus, additional processing is necessary before use.
- The data itself is quite valuable. As the authors note, much of the data actually used in training models is proprietary and private - The 4 use cases are all creative and well designed. They demonstrate the potential of the dataset as a strong resource - The analysis is quite interesting, for example the cluster analysis in figure 3. It also supports/is validated by past work, e.g. the observation that many users now are interested in LLMs for help with coding - The size and diversity of the
- The dataset does not seem quite as diverse as the abstract suggests. Although 25 LLMs are used, Vicuna-13B is by far the most frequently used one. Similarly, while there may be a few instances of many languages, the vast majority is still in English - As the authors point out, there are no human preference values which is one weakness compared to related datasets (see table 1) - It is not clear whether the use format (single model vs side-by-side) and IP address have similar balance issues to
- The dataset introduced by this paper is a large-scale dataset containing interactive logs of 210,000 unique IP addresses with 25 large language models, which is both extremely valuable and meaningful for future LLM development, given most datasets that were used to train LLMs are not publicly available. - Part of the data that can jailbreak the safeguards of leading LLMs is repurposed by the authors to be a benchmark for safety and robustness study. - The authors also curate a benchmark that c
- Figure 1 and 2 could be redrawn by leaving vicuna/English out because the distribution is left-skewed and therefore the number for the rest of the models/languages are hard to interpret. - Although in the limitation section there is a paragraph about the data quality, deeper analysis could be done. For example, how many of them is from MMLU and MT-Bench, is there some human annotation done for duplicate data?
Code & Models
- 🤗Goodfire/Llama-3.1-8B-Instruct-SAE-l19model· 31 dl· ♡ 4331 dl♡ 43
- 🤗openGPT-X/Teuken-7B-instruct-research-v0.4model· 1.9k dl· ♡ 891.9k dl♡ 89
- 🤗KnutJaegersberg/Teuken-7B-instruct-research-v0.4-8.0bpw-exl2model· 1 dl1 dl
- 🤗QuantFactory/Teuken-7B-instruct-research-v0.4-GGUFmodel· 281 dl· ♡ 2281 dl♡ 2
- 🤗Goodfire/Llama-3.3-70B-Instruct-SAE-l50model· 29 dl· ♡ 3829 dl♡ 38
- 🤗qresearch/Llama-3.2-1B-Instruct-SAE-l9model· ♡ 16♡ 16
- 🤗qresearch/DeepSeek-R1-Distill-Llama-8B-SAE-l19model· ♡ 8♡ 8
- 🤗qresearch/DeepSeek-R1-Distill-Llama-70B-SAE-l48model· ♡ 14♡ 14
- 🤗rhcl/Teuken-fientuunmodel· 1 dl1 dl
- 🤗PJMixers-Dev/Gemma-3-Earthen-v0.1-4B-QLoRAmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Byte Pair Encoding · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Residual Connection · Adam
