GenAI Content Detection Task 2: AI vs. Human -- Academic Essay Authenticity Challenge
Shammur Absar Chowdhury, Hind Almerekhi, Mucahid Kutlu, Kaan Efe, Keles, Fatema Ahmad, Tasnim Mohiuddin, George Mikros, Firoj Alam

TL;DR
This paper reviews the first Academic Essay Authenticity Challenge, highlighting advances in AI vs. human essay detection with high accuracy, driven by transformer models and LLMs across English and Arabic.
Contribution
It introduces a new benchmark dataset and evaluation framework for AI-generated essay detection, showcasing state-of-the-art results and diverse approaches from multiple teams.
Findings
Top systems achieved F1 scores over 0.98
Transformer-based models significantly outperformed baselines
Both English and Arabic detection tasks showed high accuracy
Abstract
This paper presents a comprehensive overview of the first edition of the Academic Essay Authenticity Challenge, organized as part of the GenAI Content Detection shared tasks collocated with COLING 2025. This challenge focuses on detecting machine-generated vs. human-authored essays for academic purposes. The task is defined as follows: "Given an essay, identify whether it is generated by a machine or authored by a human.'' The challenge involves two languages: English and Arabic. During the evaluation phase, 25 teams submitted systems for English and 21 teams for Arabic, reflecting substantial interest in the task. Finally, seven teams submitted system description papers. The majority of submissions utilized fine-tuned transformer-based models, with one team employing Large Language Models (LLMs) such as Llama 2 and Llama 3. This paper outlines the task formulation, details the dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsLLaMA
