Fine-tuning Large Language Models for Multigenerator, Multidomain, and   Multilingual Machine-Generated Text Detection

Feng Xiong; Thanet Markchom; Ziwei Zheng; Subin Jung; Varun Ojha,; Huizhi Liang

arXiv:2401.12326·cs.CL·January 24, 2024·2 cites

Fine-tuning Large Language Models for Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection

Feng Xiong, Thanet Markchom, Ziwei Zheng, Subin Jung, Varun Ojha,, Huizhi Liang

PDF

Open Access

TL;DR

This paper explores fine-tuning large language models to detect machine-generated texts across multiple languages and domains, demonstrating that transformer models outperform traditional machine learning methods in this task.

Contribution

It introduces effective fine-tuning approaches for LLMs to improve detection of machine-generated texts in multilingual and multi-domain settings.

Findings

01

Transformer models, especially LoRA-RoBERTa, outperform traditional ML methods.

02

Majority voting enhances detection accuracy in multilingual scenarios.

03

Fine-tuned LLMs achieve higher effectiveness in identifying machine-generated texts.

Abstract

SemEval-2024 Task 8 introduces the challenge of identifying machine-generated texts from diverse Large Language Models (LLMs) in various languages and domains. The task comprises three subtasks: binary classification in monolingual and multilingual (Subtask A), multi-class classification (Subtask B), and mixed text detection (Subtask C). This paper focuses on Subtask A & B. Each subtask is supported by three datasets for training, development, and testing. To tackle this task, two methods: 1) using traditional machine learning (ML) with natural language preprocessing (NLP) for feature extraction, and 2) fine-tuning LLMs for text classification. The results show that transformer models, particularly LoRA-RoBERTa, exceed traditional ML methods in effectiveness, with majority voting being particularly effective in multilingual contexts for identifying machine-generated texts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling