The Impact of Editorial Intervention on Detecting Native Language Traces
Ahmet Yavuz Uluslu, Mark Gales, Kate Knill, Gerold Schneider

TL;DR
This study examines how different levels of editorial correction affect the ability of models to identify an author's native language from non-native texts, revealing that deeper linguistic features are key for robust detection.
Contribution
It demonstrates that native language traces persist beyond surface errors and are affected differently by various editing levels, highlighting the importance of deeper linguistic features.
Findings
L1 attribution relies on deep linguistic features beyond surface errors.
Minimal edits preserve L1 traces and high detection accuracy.
Fluency edits and paraphrasing significantly reduce detection performance.
Abstract
Native Language Identification (NLI) is the task of determining an author's native language (L1) from their non-native writings. With the advent of human-AI co-authorship, non-native texts are routinely corrected and rewritten by large language models, fundamentally altering the linguistic features NLI models depend on. In this paper, we investigate the robustness of L1 traces across increasing degrees of editorial intervention. By processing 450 essays from the Write & Improve 2024 corpus through varying levels of grammatical error correction (GEC) and paraphrasing, we demonstrate that L1 attribution does not entirely depend on surface-level errors. Instead, the detection models leverage deeper L1 features: unidiomatic lexico-semantic choices, pragmatic transfer, and the author's underlying cultural perspective. We find that minimal edits preserve these structural traces and maintain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
