On the Impact of Cross-Domain Data on German Language Models

Amin Dada; Aokun Chen; Cheng Peng; Kaleb E Smith; Ahmad; Idrissi-Yaghir; Constantin Marc Seibold; Jianning Li; Lars Heiliger; Xi Yang,; Christoph M. Friedrich; Daniel Truhn; Jan Egger; Jiang Bian; Jens Kleesiek,; Yonghui Wu

arXiv:2310.07321·cs.CL·October 16, 2023·1 cites

On the Impact of Cross-Domain Data on German Language Models

Amin Dada, Aokun Chen, Cheng Peng, Kaleb E Smith, Ahmad, Idrissi-Yaghir, Constantin Marc Seibold, Jianning Li, Lars Heiliger, Xi Yang,, Christoph M. Friedrich, Daniel Truhn, Jan Egger, Jiang Bian, Jens Kleesiek,, Yonghui Wu

PDF

Open Access 6 Models

TL;DR

This paper investigates the impact of cross-domain versus high-quality data on German language models, showing that diverse datasets lead to better performance across multiple tasks.

Contribution

It introduces a new cross-domain German dataset and demonstrates that training on diverse data yields superior results compared to high-quality data alone.

Findings

01

Models trained on cross-domain data outperform those trained on quality data.

02

Performance improvements of up to 4.45% over previous state-of-the-art.

03

Cross-domain training enhances downstream task performance.

Abstract

Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to $4.45%$ over the previous state-of-the-art. The models are available at https://huggingface.co/ikim-uk-essen

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification