A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus
Ebubekir Tosun, Mehmet Emin Buldur, \"Ozay Ezerceli, Mahmoud ElHussieni

TL;DR
This paper introduces a scalable hybrid methodology for creating large-scale semantic datasets in low-resource languages, exemplified by Turkish, significantly expanding existing resources at low cost and validating through downstream NLP tasks.
Contribution
A novel hybrid protocol combining clustering, automated classification, and dictionary integration for large-scale semantic dataset generation in low-resource languages.
Findings
Dataset of 843,000 semantic pairs created for Turkish
Embedding model achieves 90% top-1 retrieval accuracy
Classification model attains 90% F1-macro score
Abstract
We present a hybrid methodology for generating large-scale semantic relationship datasets in low-resource languages, demonstrated through a comprehensive Turkish semantic relations corpus. Our approach integrates three phases: (1) FastText embeddings with Agglomerative Clustering to identify semantic clusters, (2) Gemini 2.5-Flash for automated semantic relationship classification, and (3) integration with curated dictionary sources. The resulting dataset comprises 843,000 unique Turkish semantic pairs across three relationship types (synonyms, antonyms, co-hyponyms) representing a 10x scale increase over existing resources at minimal cost ($65). We validate the dataset through two downstream tasks: an embedding model achieving 90% top-1 retrieval accuracy and a classification model attaining 90% F1-macro. Our scalable protocol addresses critical data scarcity in Turkish NLP and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Language and cultural evolution
