GeistBERT: Breathing Life into German NLP
Raphael Scheible-Schmitt, Johann Frei

TL;DR
GeistBERT is a German-specific transformer model trained on a large corpus, achieving state-of-the-art results across multiple NLP tasks and outperforming larger models.
Contribution
It introduces a new German language model, GeistBERT, trained on 1.3 TB of data, with optimized architecture and training methods for improved NLP performance.
Findings
Achieved state-of-the-art in GermEval 2018 classification
Outperformed larger models in multiple benchmarks
Demonstrated strong results across diverse NLP tasks
Abstract
Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. We pre-trained GeistBERT using fairseq, following the RoBERTa base configuration with Whole Word Masking (WWM), and initialized from GottBERT weights. The model was trained on a 1.3 TB German corpus with dynamic masking and a fixed sequence length of 512 tokens. For evaluation, we fine-tuned the model on standard downstream tasks, including NER (CoNLL 2003, GermEval 2014), text classification (GermEval 2018 coarse/fine, 10kGNAD), and NLI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistic research and analysis · Linguistic Education and Pedagogy
