Spotlights and Blindspots: Evaluating Machine-Generated Text Detection
Kevin Stowe, Kailash Patil

TL;DR
This paper evaluates 15 machine-generated text detection models across diverse datasets and metrics, revealing significant variability in performance and emphasizing the importance of dataset and metric choices.
Contribution
It provides an empirical analysis of detection models, highlighting the impact of datasets, metrics, and methodological choices on performance evaluation.
Findings
No single detection system outperforms others across all tasks.
Model performance varies greatly depending on datasets and evaluation metrics.
Detection models perform poorly on novel human-written texts in high-risk domains.
Abstract
With the rise of generative language models, machine-generated text detection has become a critical challenge. A wide variety of models is available, but inconsistent datasets, evaluation metrics, and assessment strategies obscure comparisons of model effectiveness. To address this, we evaluate 15 different detection models from six distinct systems, as well as seven trained models, across seven English-language textual test sets and three creative human-written datasets. We provide an empirical analysis of model performance, the influence of training and evaluation data, and the impact of key metrics. We find that no single system excels in all areas and nearly all are effective for certain tasks, and the representation of model performance is critically linked to dataset and metric choices. We find high variance in model ranks based on datasets and metrics, and overall poor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
