MiChao-HuaFen 1.0: A Specialized Pre-trained Corpus Dataset for   Domain-specific Large Models

Yidong Liu; FuKai Shang; Fang Wang; Rui Xu; Jun Wang; Wei Li; Yao Li,; Conghui He

arXiv:2309.13079·cs.CL·September 27, 2023

MiChao-HuaFen 1.0: A Specialized Pre-trained Corpus Dataset for Domain-specific Large Models

Yidong Liu, FuKai Shang, Fang Wang, Rui Xu, Jun Wang, Wei Li, Yao Li,, Conghui He

PDF

Open Access

TL;DR

This paper introduces MiChao-HuaFen 1.0, a high-quality, domain-specific pre-trained corpus dataset for Chinese news and government sectors, aimed at enhancing large model performance in specialized fields.

Contribution

The paper presents a new, carefully curated dataset tailored for pre-training large models in Chinese domain-specific applications, addressing limitations of existing models.

Findings

01

Dataset supports improved domain-specific model training

02

Ensures high data quality and reliable updates

03

Facilitates deep learning research in Chinese vertical domains

Abstract

With the advancement of deep learning technologies, general-purpose large models such as GPT-4 have demonstrated exceptional capabilities across various domains. Nevertheless, there remains a demand for high-quality, domain-specific outputs in areas like healthcare, law, and finance. This paper first evaluates the existing large models for specialized domains and discusses their limitations. To cater to the specific needs of certain domains, we introduce the ``MiChao-HuaFen 1.0'' pre-trained corpus dataset, tailored for the news and governmental sectors. The dataset, sourced from publicly available internet data from 2022, underwent multiple rounds of cleansing and processing to ensure high quality and reliable origins, with provisions for consistent and stable updates. This dataset not only supports the pre-training of large models for Chinese vertical domains but also aids in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Adam · Layer Normalization · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Dense Connections