SindBERT, the Sailor: Charting the Seas of Turkish NLP

Raphael Scheible-Schmitt; Stefan Schweter

arXiv:2510.21364·cs.CL·October 27, 2025

SindBERT, the Sailor: Charting the Seas of Turkish NLP

Raphael Scheible-Schmitt, Stefan Schweter

PDF

2 Models 1 Video

TL;DR

SindBERT is the first large-scale RoBERTa-based Turkish language model, trained on extensive Turkish text, and evaluated across multiple NLP tasks, revealing insights into scaling limits and corpus quality importance.

Contribution

Introduces SindBERT, the first large-scale Turkish RoBERTa-based encoder, and provides an empirical study on scaling effects and corpus quality in morphologically rich languages.

Findings

01

SindBERT performs competitively with existing models.

02

Scaling benefits are limited, indicating possible benchmark saturation.

03

Corpus quality can outweigh data volume in model performance.

Abstract

Transformer models have revolutionized NLP, yet many morphologically rich languages remain underrepresented in large-scale pre-training efforts. With SindBERT, we set out to chart the seas of Turkish NLP, providing the first large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312 GB of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base and large configurations, representing the first large-scale encoder-only language model available for Turkish. We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT performs competitively with existing Turkish and multilingual models, with the large variant achieving the best scores in two of four tasks but showing no consistent scaling advantage overall. This flat scaling trend,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

SindBERT, the Sailor: Charting the Seas of Turkish NLP· underline