Native Language Identification with Large Language Models
Wei Zhang, Alexandre Salle

TL;DR
This paper demonstrates that large language models like GPT-4 can effectively identify a person's native language from their second language writings, achieving high accuracy and providing interpretable justifications.
Contribution
It is the first to evaluate LLMs for NLI, showing their high performance and ability to generalize beyond fixed classes in real-world scenarios.
Findings
GPT-4 achieves 91.7% accuracy on TOEFL11 in zero-shot mode.
LLMs can perform NLI without predefined class limitations.
LLMs can justify their predictions with linguistic reasoning.
Abstract
We present the first experiments on Native Language Identification (NLI) using LLMs such as GPT-4. NLI is the task of predicting a writer's first language by analyzing their writings in a second language, and is used in second language acquisition and forensic linguistics. Our results show that GPT models are proficient at NLI classification, with GPT-4 setting a new performance record of 91.7% on the benchmark TOEFL11 test set in a zero-shot setting. We also show that unlike previous fully-supervised settings, LLMs can perform NLI without being limited to a set of known classes, which has practical implications for real-world applications. Finally, we also show that LLMs can provide justification for their choices, providing reasoning based on spelling errors, syntactic patterns, and usage of directly translated linguistic patterns.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Hate Speech and Cyberbullying Detection
MethodsSparse Evolutionary Training · Multi-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Transformer · Residual Connection · Dropout
