HUKUKBERT: Domain-Specific Language Model for Turkish Law
Mehmet Utku \"Ozt\"urk, Tansu T\"urko\u{g}lu, Buse Buz-Yalug

TL;DR
HukukBERT is a comprehensive Turkish legal language model trained on an 18 GB corpus, achieving state-of-the-art results in legal term prediction and court decision segmentation, supporting future legal NLP research.
Contribution
The paper introduces HukukBERT, the first large-scale Turkish legal language model trained with a novel DAPT approach, outperforming existing models on legal benchmarks.
Findings
HukukBERT achieves 84.40% Top-1 accuracy on Legal Cloze Test.
It attains a 92.8% document pass rate in court decision segmentation.
The model outperforms existing Turkish legal NLP models.
Abstract
Recent advances in natural language processing (NLP) have increasingly enabled LegalTech applications, yet existing studies specific to Turkish law have still been limited due to the scarcity of domain-specific data and models. Although extensive models like LEGAL-BERT have been developed for English legal texts, the Turkish legal domain lacks a domain-specific high-volume counterpart. In this paper, we introduce HukukBERT, the most comprehensive legal language model for Turkish, trained on a 18 GB cleaned legal corpus using a hybrid Domain-Adaptive Pre-Training (DAPT) methodology integrating Whole-Word Masking, Token Span Masking, Word Span Masking, and targeted Keyword Masking. We systematically compared our 48K WordPiece tokenizer and DAPT approach against general-purpose and existing domain-specific Turkish models. Evaluated on a novel Legal Cloze Test benchmark -- a masked legal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
