Native Language Identification with Big Bird Embeddings

Sergey Kramp; Giovanni Cassani; Chris Emmery

arXiv:2309.06923·cs.CL·September 14, 2023

Native Language Identification with Big Bird Embeddings

Sergey Kramp, Giovanni Cassani, Chris Emmery

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that Big Bird transformer embeddings significantly improve native language identification accuracy over traditional linguistic features, especially with longer input texts, offering a practical and efficient solution.

Contribution

It introduces the use of Big Bird embeddings for NLI, showing superior performance and input length scalability compared to prior linguistic feature-based models.

Findings

01

Big Bird embeddings outperform linguistic feature models on Reddit-L2 dataset.

02

Input length positively correlates with classification accuracy.

03

The method maintains consistent out-of-sample performance.

Abstract

Native Language Identification (NLI) intends to classify an author's native language based on their writing in another language. Historically, the task has heavily relied on time-consuming linguistic feature engineering, and transformer-based NLI models have thus far failed to offer effective, practical alternatives. The current work investigates if input size is a limiting factor, and shows that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models by a large margin on the Reddit-L2 dataset. Additionally, we provide further insight into input length dependencies, show consistent out-of-sample performance, and qualitatively analyze the embedding space. Given the effectiveness and computational efficiency of this method, we believe it offers a promising avenue for future NLI work.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sergeykramp/mthesis-bigbird-embeddings
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection · Natural Language Processing Techniques