Native Language Identification with Big Bird Embeddings
Sergey Kramp, Giovanni Cassani, Chris Emmery

TL;DR
This paper demonstrates that Big Bird transformer embeddings significantly improve native language identification accuracy over traditional linguistic features, especially with longer input texts, offering a practical and efficient solution.
Contribution
It introduces the use of Big Bird embeddings for NLI, showing superior performance and input length scalability compared to prior linguistic feature-based models.
Findings
Big Bird embeddings outperform linguistic feature models on Reddit-L2 dataset.
Input length positively correlates with classification accuracy.
The method maintains consistent out-of-sample performance.
Abstract
Native Language Identification (NLI) intends to classify an author's native language based on their writing in another language. Historically, the task has heavily relied on time-consuming linguistic feature engineering, and transformer-based NLI models have thus far failed to offer effective, practical alternatives. The current work investigates if input size is a limiting factor, and shows that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models by a large margin on the Reddit-L2 dataset. Additionally, we provide further insight into input length dependencies, show consistent out-of-sample performance, and qualitatively analyze the embedding space. Given the effectiveness and computational efficiency of this method, we believe it offers a promising avenue for future NLI work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection · Natural Language Processing Techniques
