Turkish Native Language Identification V2
Ahmet Yavuz Uluslu, Gerold Schneider

TL;DR
This study applies native language identification to Turkish, using syntactic features to distinguish speakers' L1 based on their L2 Turkish writing, extending NLI research beyond English.
Contribution
It introduces the first NLI approach for Turkish, analyzing L1 transfer effects with syntactic features and providing publicly available data and code.
Findings
Syntactic features effectively distinguish L1 speakers.
Part-of-Speech n-gram models outperform hybrid models.
L1 transfer effects are revealed through feature analysis.
Abstract
This paper presents the first application of Native Language Identification (NLI) for the Turkish language. NLI is the task of automatically identifying an individual's native language (L1) based on their writing or speech in a non-native language (L2). While most NLI research has focused on L2 English, our study extends this scope to L2 Turkish by analyzing a corpus of texts written by native speakers of Albanian, Arabic and Persian. We leverage a cleaned version of the Turkish Learner Corpus and demonstrate the effectiveness of syntactic features, comparing a structural Part-of-Speech n-gram model to a hybrid model that retains function words. Our models achieve promising results, and we analyze the most predictive features to reveal L1-specific transfer effects. We make our data and code publicly available for further study.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Translation Studies and Practices
