COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Yuelin Bai; Xinrun Du; Yiming Liang; Yonggang Jin; Junting Zhou,; Ziqiang Liu; Feiteng Fang; Mingshan Chang; Tianyu Zheng; Xincheng Zhang; Nuo; Ma; Zekun Wang; Ruibin Yuan; Haihong Wu; Hongquan Lin; Wenhao Huang; Jiajun; Zhang; Chenghua Lin; Jie Fu; Min Yang; Shiwen Ni; Ge Zhang

arXiv:2403.18058·cs.CL·November 5, 2024·5 cites

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Junting Zhou,, Ziqiang Liu, Feiteng Fang, Mingshan Chang, Tianyu Zheng, Xincheng Zhang, Nuo, Ma, Zekun Wang, Ruibin Yuan, Haihong Wu, Hongquan Lin, Wenhao Huang, Jiajun, Zhang, Chenghua Lin, Jie Fu, Min Yang, Shiwen Ni, Ge Zhang

PDF

Open Access 5 Datasets

TL;DR

This paper introduces COIG-CQIA, a new Chinese instruction tuning dataset created from real-world resources, which improves the performance of language models on Chinese tasks and offers insights into dataset design.

Contribution

We present COIG-CQIA, a human-verified Chinese instruction dataset tailored for LLMs, addressing linguistic and interaction pattern gaps in existing datasets.

Findings

01

Models trained on COIG-CQIA outperform baselines on Chinese benchmarks.

02

The dataset enables more effective Chinese instruction tuning.

03

Insights into data-mixing strategies for Chinese LLMs.

Abstract

Remarkable progress on English instruction tuning has facilitated the efficacy and reliability of large language models (LLMs). However, there remains a noticeable gap in instruction tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users' interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world resources and undergoing rigorous human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Technology and Assessment

MethodsALIGN