Supporting Human Raters with the Detection of Harmful Content using   Large Language Models

Kurt Thomas; Patrick Gage Kelley; David Tao; Sarah Meiklejohn; and Owen Vallis; Shunwen Tan; Bla\v{z} Bratani\v{c}; Felipe Tiengo; Ferreira; Vijay Kumar Eranti; Elie Bursztein

arXiv:2406.12800·cs.CR·June 19, 2024·1 cites

Supporting Human Raters with the Detection of Harmful Content using Large Language Models

Kurt Thomas, Patrick Gage Kelley, David Tao, Sarah Meiklejohn, and Owen Vallis, Shunwen Tan, Bla\v{z} Bratani\v{c}, Felipe Tiengo, Ferreira, Vijay Kumar Eranti, Elie Bursztein

PDF

Open Access

TL;DR

This paper investigates how large language models can assist human raters in identifying harmful online content, achieving high accuracy and improving efficiency and precision in real-world moderation tasks.

Contribution

It introduces five design patterns for integrating LLMs with human moderation, supported by a unified prompt, and demonstrates significant improvements in capacity and detection metrics.

Findings

01

LLMs achieved 90% accuracy compared to human verdicts.

02

Real-world pilot increased rater capacity by 41.5%.

03

Detection precision and recall improved by 9-11%.

Abstract

In this paper, we explore the feasibility of leveraging large language models (LLMs) to automate or otherwise assist human raters with identifying harmful content including hate speech, harassment, violent extremism, and election misinformation. Using a dataset of 50,000 comments, we demonstrate that LLMs can achieve 90% accuracy when compared to human verdicts. We explore how to best leverage these capabilities, proposing five design patterns that integrate LLMs with human rating, such as pre-filtering non-violative content, detecting potential errors in human rating, or surfacing critical context to support human rating. We outline how to support all of these design patterns using a single, optimized prompt. Beyond these synthetic experiments, we share how piloting our proposed techniques in a real-world review queue yielded a 41.5% improvement in optimizing available human rater…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation and Cyber Security · Deception detection and forensic psychology · Software Engineering Research