GeistBERT: Breathing Life into German NLP

Raphael Scheible-Schmitt; Johann Frei

arXiv:2506.11903·cs.CL·July 14, 2025

GeistBERT: Breathing Life into German NLP

Raphael Scheible-Schmitt, Johann Frei

PDF

Open Access

TL;DR

GeistBERT is a German-specific transformer model trained on a large corpus, achieving state-of-the-art results across multiple NLP tasks and outperforming larger models.

Contribution

It introduces a new German language model, GeistBERT, trained on 1.3 TB of data, with optimized architecture and training methods for improved NLP performance.

Findings

01

Achieved state-of-the-art in GermEval 2018 classification

02

Outperformed larger models in multiple benchmarks

03

Demonstrated strong results across diverse NLP tasks

Abstract

Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. We pre-trained GeistBERT using fairseq, following the RoBERTa base configuration with Whole Word Masking (WWM), and initialized from GottBERT weights. The model was trained on a 1.3 TB German corpus with dynamic masking and a fixed sequence length of 512 tokens. For evaluation, we fine-tuned the model on standard downstream tasks, including NER (CoNLL 2003, GermEval 2014), text classification (GermEval 2018 coarse/fine, 10kGNAD), and NLI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLinguistic research and analysis · Linguistic Education and Pedagogy