MiChao-HuaFen 1.0: A Specialized Pre-trained Corpus Dataset for Domain-specific Large Models
Yidong Liu, FuKai Shang, Fang Wang, Rui Xu, Jun Wang, Wei Li, Yao Li,, Conghui He

TL;DR
This paper introduces MiChao-HuaFen 1.0, a high-quality, domain-specific pre-trained corpus dataset for Chinese news and government sectors, aimed at enhancing large model performance in specialized fields.
Contribution
The paper presents a new, carefully curated dataset tailored for pre-training large models in Chinese domain-specific applications, addressing limitations of existing models.
Findings
Dataset supports improved domain-specific model training
Ensures high data quality and reliable updates
Facilitates deep learning research in Chinese vertical domains
Abstract
With the advancement of deep learning technologies, general-purpose large models such as GPT-4 have demonstrated exceptional capabilities across various domains. Nevertheless, there remains a demand for high-quality, domain-specific outputs in areas like healthcare, law, and finance. This paper first evaluates the existing large models for specialized domains and discusses their limitations. To cater to the specific needs of certain domains, we introduce the ``MiChao-HuaFen 1.0'' pre-trained corpus dataset, tailored for the news and governmental sectors. The dataset, sourced from publicly available internet data from 2022, underwent multiple rounds of cleansing and processing to ensure high quality and reliable origins, with provisions for consistent and stable updates. This dataset not only supports the pre-training of large models for Chinese vertical domains but also aids in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Adam · Layer Normalization · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Dense Connections
