Optimizing Large Language Models for Turkish: New Methodologies in Corpus Selection and Training
H. Toprak Kesgin, M. Kaan Yuce, Eren Dogan, M. Egemen Uzun, Atahan Uz,, Elif Ince, Yusuf Erdem, Osama Shbib, Ahmed Zeer, M. Fatih Amasyali

TL;DR
This paper introduces new corpus selection and training methodologies to enhance Turkish language models, leveraging adapted datasets and merging techniques to significantly improve accuracy and comprehension in under-resourced language settings.
Contribution
It presents novel corpus adaptation and merging strategies specifically designed for Turkish, demonstrating substantial performance improvements over existing models.
Findings
Enhanced model accuracy in few-shot and zero-shot scenarios
Improved task-specific performance and language comprehension
Effective merging of adapted models boosts overall performance
Abstract
In this study, we develop and assess new corpus selection and training methodologies to improve the effectiveness of Turkish language models. Specifically, we adapted Large Language Model generated datasets and translated English datasets into Turkish, integrating these resources into the training process. This approach led to substantial enhancements in model accuracy for both few-shot and zero-shot learning scenarios. Furthermore, the merging of these adapted models was found to markedly improve their performance. Human evaluative metrics, including task-specific performance assessments, further demonstrated that these adapted models possess a greater aptitude for comprehending the Turkish language and addressing logic-based queries. This research underscores the importance of refining corpus selection strategies to optimize the performance of multilingual models, particularly for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
