Native Language Identification with Large Language Models

Wei Zhang; Alexandre Salle

arXiv:2312.07819·cs.CL·December 14, 2023·1 cites

Native Language Identification with Large Language Models

Wei Zhang, Alexandre Salle

PDF

Open Access

TL;DR

This paper demonstrates that large language models like GPT-4 can effectively identify a person's native language from their second language writings, achieving high accuracy and providing interpretable justifications.

Contribution

It is the first to evaluate LLMs for NLI, showing their high performance and ability to generalize beyond fixed classes in real-world scenarios.

Findings

01

GPT-4 achieves 91.7% accuracy on TOEFL11 in zero-shot mode.

02

LLMs can perform NLI without predefined class limitations.

03

LLMs can justify their predictions with linguistic reasoning.

Abstract

We present the first experiments on Native Language Identification (NLI) using LLMs such as GPT-4. NLI is the task of predicting a writer's first language by analyzing their writings in a second language, and is used in second language acquisition and forensic linguistics. Our results show that GPT models are proficient at NLI classification, with GPT-4 setting a new performance record of 91.7% on the benchmark TOEFL11 test set in a zero-shot setting. We also show that unlike previous fully-supervised settings, LLMs can perform NLI without being limited to a set of known classes, which has practical implications for real-world applications. Finally, we also show that LLMs can provide justification for their choices, providing reasoning based on spelling errors, syntactic patterns, and usage of directly translated linguistic patterns.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Hate Speech and Cyberbullying Detection

MethodsSparse Evolutionary Training · Multi-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Transformer · Residual Connection · Dropout