TabiBERT: A Large-Scale ModernBERT Foundation Model and A Unified Benchmark for Turkish
Melik\c{s}ah T\"urker, A. Ebrar K{\i}z{\i}lo\u{g}lu, Onur G\"ung\"or, Susan \"Usk\"udarl{\i}

TL;DR
TabiBERT is a large-scale, modern Turkish language model that outperforms previous models across multiple tasks, supported by a comprehensive benchmark and trained on a vast, diverse corpus.
Contribution
This work introduces TabiBERT, the first monolingual ModernBERT-based Turkish encoder trained from scratch on a large, multi-domain corpus, and presents TabiBench, a standardized benchmark for Turkish NLP.
Findings
TabiBERT achieves 77.58 on TabiBench, surpassing BERTurk by 1.62 points.
It attains up to 2.65x inference speedup and reduced GPU memory usage.
Outperforms task-specific models like TurkishBERTweet on average by 1.47 points.
Abstract
Since the inception of BERT, encoder-only Transformers have evolved significantly in computational efficiency, training stability, and long-context modeling. ModernBERT consolidates these advances by integrating Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Despite these developments, Turkish NLP lacks a monolingual encoder trained from scratch, incorporating such modern architectural paradigms. This work introduces TabiBERT, a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus. TabiBERT is pre-trained on one trillion tokens sampled from an 84.88B token multi-domain corpus: web text (73%), scientific publications (20%), source code (6%), and mathematical content (0.3%). It supports 8,192-token context length (16x original BERT), achieves up to 2.65x inference speedup, and reduces GPU memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing
