HANSEN: Human and AI Spoken Text Benchmark for Authorship Analysis

Nafis Irtiza Tripto; Adaku Uchendu; Thai Le; Mattia Setzu; Fosca; Giannotti; Dongwon Lee

arXiv:2310.16746·cs.CL·October 26, 2023·2 cites

HANSEN: Human and AI Spoken Text Benchmark for Authorship Analysis

Nafis Irtiza Tripto, Adaku Uchendu, Thai Le, Mattia Setzu, Fosca, Giannotti, Dongwon Lee

PDF

Open Access 1 Datasets

TL;DR

HANSEN is a comprehensive benchmark dataset for spoken texts, enabling evaluation of authorship analysis and AI-generated speech detection, highlighting current limitations and future research directions in spoken language authorship attribution.

Contribution

Introduces the largest spoken text benchmark, HANSEN, with datasets from humans and AI, facilitating research in spoken authorship analysis and AI speech detection.

Findings

01

SOTA methods perform similarly on spoken and written texts for authorship attribution.

02

Current models struggle to detect AI-generated spoken texts effectively.

03

HANSEN provides a valuable resource for advancing spoken language authorship research.

Abstract

Authorship Analysis, also known as stylometry, has been an essential aspect of Natural Language Processing (NLP) for a long time. Likewise, the recent advancement of Large Language Models (LLMs) has made authorship analysis increasingly crucial for distinguishing between human-written and AI-generated texts. However, these authorship analysis tasks have primarily been focused on written texts, not considering spoken texts. Thus, we introduce the largest benchmark for spoken texts - HANSEN (Human ANd ai Spoken tExt beNchmark). HANSEN encompasses meticulous curation of existing speech datasets accompanied by transcripts, alongside the creation of novel AI-generated spoken text datasets. Together, it comprises 17 human datasets, and AI-generated spoken texts created using 3 prominent LLMs: ChatGPT, PaLM2, and Vicuna13B. To evaluate and demonstrate the utility of HANSEN, we perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

HANSEN-REPO/HANSEN
dataset· 17 dl
17 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection · Topic Modeling