Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain
\"Ozg\"ur U\u{g}ur, Mahmut G\"oksu, Mahmut \c{C}imen, Musa Y{\i}lmaz, Esra \c{S}avirdi, Alp Talha Demir, Rumeysa G\"ull\"uce, \.Iclal \c{C}etin, \"Omer Can Sa\u{g}ba\c{s}

TL;DR
This paper introduces Mecellem models, specialized Turkish legal language models developed via scratch pre-training and continual domain adaptation, achieving high retrieval performance and domain-specific understanding with efficient training strategies.
Contribution
It presents a novel framework for Turkish legal domain models, including a scratch-trained encoder with checkpoint selection and a continual pre-training decoder, both optimized for efficiency and performance.
Findings
Encoder models achieve top-3 Turkish retrieval leaderboard rankings.
The approach attains 92.36% production efficiency compared to state-of-the-art models.
Continual pre-training reduces perplexity by 36.2% on Turkish legal texts.
Abstract
This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗newmindai/Mursit-Base-TR-Retrievalmodel· 833 dl· ♡ 4833 dl♡ 4
- 🤗newmindai/Mursit-Basemodel· 3 dl· ♡ 33 dl♡ 3
- 🤗newmindai/Mursit-Large-TR-Retrievalmodel· 454 dl· ♡ 6454 dl♡ 6
- 🤗newmindai/Mursit-Largemodel· 91 dl· ♡ 491 dl♡ 4
- 🤗newmindai/Mecellem-Qwen3-1.7B-TRmodel· 24 dl· ♡ 424 dl♡ 4
- 🤗newmindai/Mursit-Embed-Qwen3-1.7B-TRmodel· 105 dl· ♡ 3105 dl♡ 3
- 🤗newmindai/Mecellem-Qwen3-4B-TRmodel· 99 dl· ♡ 399 dl♡ 3
- 🤗newmindai/Mursit-Embed-Qwen3-4B-TRmodel· 6 dl· ♡ 26 dl♡ 2
- 🤗newmindai/Muhakimmodel· 11 dl· ♡ 411 dl♡ 4
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Law · Natural Language Processing Techniques
