A Comprehensive Dataset for Human vs. AI Generated Text Detection

Rajarshi Roy; Nasrin Imanpour; Ashhar Aziz; Shashwat Bajpai; Gurpreet Singh; Shwetangshu Biswas; Kapil Wanaskar; Parth Patwa; Subhankar Ghosh; Shreyas Dixit; Nilesh Ranjan Pal; Vipula Rawte; Ritvik Garimella; Gaytri Jena; Amit Sheth; Vasu Sharma; Aishwarya Naresh Reganti; Vinija Jain; Aman Chadha; Amitava Das

arXiv:2510.22874·cs.CL·March 3, 2026

A Comprehensive Dataset for Human vs. AI Generated Text Detection

Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti

PDF

TL;DR

This paper introduces a large, diverse dataset of over 58,000 texts combining authentic news articles and AI-generated versions from multiple models, aiming to improve detection and attribution of AI-generated text.

Contribution

It provides a comprehensive, well-annotated dataset bridging real-world journalism and AI-generated content to facilitate detection and attribution research.

Findings

01

Baseline detection accuracy of 58.35% for distinguishing human vs. AI text.

02

Attribution accuracy of 8.92% for identifying the specific AI model.

03

Dataset includes diverse models and real news content for robust evaluation.

Abstract

The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 58,000 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35\%, and attributing AI texts to their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.