HANSEN: Human and AI Spoken Text Benchmark for Authorship Analysis
Nafis Irtiza Tripto, Adaku Uchendu, Thai Le, Mattia Setzu, Fosca, Giannotti, Dongwon Lee

TL;DR
HANSEN is a comprehensive benchmark dataset for spoken texts, enabling evaluation of authorship analysis and AI-generated speech detection, highlighting current limitations and future research directions in spoken language authorship attribution.
Contribution
Introduces the largest spoken text benchmark, HANSEN, with datasets from humans and AI, facilitating research in spoken authorship analysis and AI speech detection.
Findings
SOTA methods perform similarly on spoken and written texts for authorship attribution.
Current models struggle to detect AI-generated spoken texts effectively.
HANSEN provides a valuable resource for advancing spoken language authorship research.
Abstract
Authorship Analysis, also known as stylometry, has been an essential aspect of Natural Language Processing (NLP) for a long time. Likewise, the recent advancement of Large Language Models (LLMs) has made authorship analysis increasingly crucial for distinguishing between human-written and AI-generated texts. However, these authorship analysis tasks have primarily been focused on written texts, not considering spoken texts. Thus, we introduce the largest benchmark for spoken texts - HANSEN (Human ANd ai Spoken tExt beNchmark). HANSEN encompasses meticulous curation of existing speech datasets accompanied by transcripts, alongside the creation of novel AI-generated spoken text datasets. Together, it comprises 17 human datasets, and AI-generated spoken texts created using 3 prominent LLMs: ChatGPT, PaLM2, and Vicuna13B. To evaluate and demonstrate the utility of HANSEN, we perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection · Topic Modeling
