Distinguishing Chatbot from Human
Gauri Anil Godghase, Rishit Agrawal, Tanush Obili, Mark Stamp

TL;DR
This paper introduces a large dataset and machine learning methods to distinguish between human-written and chatbot-generated text, achieving high classification accuracy to address the challenge posed by advanced AI chatbots.
Contribution
The paper presents a new extensive dataset and compares feature analysis and embedding-based ML techniques for text origin classification, advancing detection of AI-generated content.
Findings
High classification accuracy achieved
Effective feature and embedding-based models developed
Enhanced understanding of chatbot text characteristics
Abstract
There have been many recent advances in the fields of generative Artificial Intelligence (AI) and Large Language Models (LLM), with the Generative Pre-trained Transformer (GPT) model being a leading "chatbot." LLM-based chatbots have become so powerful that it may seem difficult to differentiate between human-written and machine-generated text. To analyze this problem, we have developed a new dataset consisting of more than 750,000 human-written paragraphs, with a corresponding chatbot-generated paragraph for each. Based on this dataset, we apply Machine Learning (ML) techniques to determine the origin of text (human or chatbot). Specifically, we consider two methodologies for tackling this issue: feature analysis and embeddings. Our feature analysis approach involves extracting a collection of features from the text for classification. We also explore the use of contextual embeddings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Sentiment Analysis and Opinion Mining
MethodsLinear Layer · Layer Normalization · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections
