Evaluating AI and Human Authorship Quality in Academic Writing through Physics Essays
Will Yeadon, Elise Agra, Oto-obong Inyang, Paul Mackay, Arin Mizouri

TL;DR
This study compares AI-generated and human physics essays, finding no significant quality difference and highlighting the challenge of distinguishing AI content through human judgment, while evaluating AI detection tools.
Contribution
It provides empirical evidence that AI and human essays are indistinguishable in quality and assesses the effectiveness of AI authorship detection tools.
Findings
No significant score difference between AI and human essays
Detection tools vary in accuracy, with ZeroGPT achieving 98% accuracy
Humans struggle to reliably identify AI-generated essays
Abstract
This study evaluates short-form physics essay submissions, equally divided between student work submitted before the introduction of ChatGPT and those generated by OpenAI's GPT-4. In blinded evaluations conducted by five independent markers who were unaware of the origin of the essays, we observed no statistically significant differences in scores between essays authored by humans and those produced by AI (p-value , = 0.05). Additionally, when the markers subsequently attempted to identify the authorship of the essays on a 4-point Likert scale - from `Definitely AI' to `Definitely Human' - their performance was only marginally better than random chance. This outcome not only underscores the convergence of AI and human authorship quality but also highlights the difficulty of discerning AI-generated content solely through human judgment. Furthermore, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education
