FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from   the Web

Cheng-Wei Lin; Wan-Hsuan Hsieh; Kai-Xin Guan; Chan-Jan Hsu; Chia-Chen; Kuo; Chuan-Lin Lai; Chung-Wei Chung; Ming-Jen Wang; Da-Shan Shiu

arXiv:2411.16387·cs.CL·November 26, 2024

FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web

Cheng-Wei Lin, Wan-Hsuan Hsieh, Kai-Xin Guan, Chan-Jan Hsu, Chia-Chen, Kuo, Chuan-Lin Lai, Chung-Wei Chung, Ming-Jen Wang, Da-Shan Shiu

PDF

Open Access

TL;DR

FineWeb-zhtw is a large, high-quality dataset for Traditional Chinese language models, created through meticulous filtering to ensure linguistic relevance and comprehensiveness, addressing a gap in Chinese NLP resources.

Contribution

The paper introduces FineWeb-zhtw, a novel curated dataset for Traditional Chinese, with tailored filtering processes to improve data quality for language model training.

Findings

01

Effective filtering methods validated on dataset samples

02

Public availability of code and dataset

03

Enhanced dataset quality for Traditional Chinese NLP

Abstract

The quality and size of a pretraining dataset significantly influence the performance of large language models (LLMs). While there have been numerous efforts in the curation of such a dataset for English users, there is a relative lack of similar initiatives for Traditional Chinese. Building upon this foundation of FineWeb, we introduce FineWeb-zhtw, a dataset tailored specifically for Traditional Chinese users. We came up with multiple stages of meticulously designed filters to cater to the linguistic difference between English and Traditional Chinese, to ensure comprehensiveness and quality. We determined effectiveness from querying dataset samples with three main objectives. Our code and datasets are publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Digital Humanities and Scholarship