Leveraging Open-Source Large Language Models for Native Language   Identification

Yee Man Ng; Ilia Markov

arXiv:2409.09659·cs.CL·January 22, 2025

Leveraging Open-Source Large Language Models for Native Language Identification

Yee Man Ng, Ilia Markov

PDF

Open Access

TL;DR

This paper investigates the use of open-source large language models for native language identification, showing that fine-tuning enables them to match the performance of proprietary models, despite lower out-of-the-box accuracy.

Contribution

It demonstrates that open-source LLMs, when fine-tuned, can perform comparably to closed-source models in NLI tasks, addressing cost and transparency issues.

Findings

01

Open-source LLMs underperform out-of-the-box compared to closed-source models.

02

Fine-tuning improves open-source LLM performance to match commercial models.

03

Open-source models offer a viable, transparent alternative for NLI after fine-tuning.

Abstract

Native Language Identification (NLI) - the task of identifying the native language (L1) of a person based on their writing in the second language (L2) - has applications in forensics, marketing, and second language acquisition. Historically, conventional machine learning approaches that heavily rely on extensive feature engineering have outperformed transformer-based language models on this task. Recently, closed-source generative large language models (LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in a zero-shot setting, including promising results in open-set classification. However, closed-source LLMs have many disadvantages, such as high costs and undisclosed nature of training data. This study explores the potential of using open-source LLMs for NLI. Our results indicate that open-source LLMs do not reach the accuracy levels of closed-source LLMs when used…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Dropout