MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts

Dominik Macko; Jakub Kopal; Robert Moro; Ivan Srba

arXiv:2406.12549·cs.CL·July 28, 2025·1 cites

MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts

Dominik Macko, Jakub Kopal, Robert Moro, Ivan Srba

PDF

Open Access 1 Video

TL;DR

This paper introduces MultiSocial, a comprehensive multilingual benchmark dataset for evaluating machine-generated text detection on social-media texts across 22 languages and 5 platforms, addressing a significant gap in current research.

Contribution

It provides the first large-scale multilingual and multi-platform dataset for social-media text detection, enabling evaluation of existing methods in zero-shot and fine-tuned settings.

Findings

01

Fine-tuned detectors perform well on social-media texts.

02

Platform selection influences detection performance.

03

Existing detection methods can be effectively trained on social-media data.

Abstract

Recent LLMs are able to generate high-quality multilingual texts, indistinguishable for humans from authentic human-written ones. Research in machine-generated text detection is however mostly focused on the English language and longer texts, such as news articles, scientific papers or student essays. Social-media texts are usually much shorter and often feature informal language, grammatical errors, or distinct linguistic items (e.g., emoticons, hashtags). There is a gap in studying the ability of existing methods in detection of such texts, reflected also in the lack of existing multilingual benchmark datasets. To fill this gap we propose the first multilingual (22 languages) and multi-platform (5 social media platforms) dataset for benchmarking machine-generated text detection in the social-media domain, called MultiSocial. It contains 472,097 texts, of which about 58k are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts· underline

Taxonomy

TopicsAuthorship Attribution and Profiling