# Native language identification from text using a fine-tuned GPT-2 model

**Authors:** Yuzhe Nie

PMC · DOI: 10.7717/peerj-cs.2909 · PeerJ Computer Science · 2025-05-28

## TL;DR

This paper shows that a fine-tuned GPT-2 model can accurately identify a person's native language from their Portuguese text, outperforming other methods.

## Contribution

The novel contribution is using a fine-tuned GPT-2 model for native language identification, achieving state-of-the-art performance on the NLI-PT dataset.

## Key findings

- The fine-tuned GPT-2 model achieved a weighted F1 score of 0.9419 and 94.65% accuracy.
- GPT-2 outperformed traditional ML models and other pre-trained language models like BERT and RoBERTa.
- The results suggest transformer-based models are effective for native language identification tasks.

## Abstract

Native language identification (NLI) is a critical task in computational linguistics, supporting applications such as personalized language learning, forensic analysis, and machine translation. This study investigates the use of a fine-tuned GPT-2 model to enhance NLI accuracy. Using the NLI-PT dataset, we preprocess and fine-tune GPT-2 to classify the native language of learners based on their Portuguese-written texts. Our approach leverages deep learning techniques, including tokenization, embedding extraction, and multi-layer transformer-based classification. Experimental results show that our fine-tuned GPT-2 model significantly outperforms traditional machine learning methods (e.g., SVM, Random Forest) and other pre-trained language models (e.g., BERT, RoBERTa, BioBERT), achieving a weighted F1 score of 0.9419 and an accuracy of 94.65%. These results show that large transformer models work well for native language identification and can help guide future research in personalized language tools and artificial intelligence (AI)-based education.

## Full-text entities

- **Genes:** GPT2 (glutamic--pyruvic transaminase 2) [NCBI Gene 84706] {aka ALT2, GPT 2, MRT49, NEDSPM}
- **Diseases:** PT (MESH:D006526), NLI (MESH:C538343)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12192634/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12192634/full.md

## References

44 references — full list in the complete paper: https://tomesphere.com/paper/PMC12192634/full.md

---
Source: https://tomesphere.com/paper/PMC12192634