Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path   Forward

Xuan Xie; Jiayang Song; Zhehua Zhou; Yuheng Huang; Da Song; Lei Ma

arXiv:2404.08517·cs.SE·April 15, 2024·1 cites

Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward

Xuan Xie, Jiayang Song, Zhehua Zhou, Yuheng Huang, Da Song, Lei Ma

PDF

Open Access

TL;DR

This paper introduces a comprehensive benchmark and evaluation of online safety analysis methods for LLMs, addressing the gap in real-time safety detection during text generation and proposing hybrid approaches to improve reliability.

Contribution

It establishes the first public benchmark for online safety analysis of LLMs, evaluates existing methods, and explores hybrid techniques to enhance safety detection during generation.

Findings

01

Existing methods have varying strengths and weaknesses.

02

Hybrid approaches can improve safety detection accuracy.

03

Benchmark provides a standardized platform for future research.

Abstract

While Large Language Models (LLMs) have seen widespread applications across numerous fields, their limited interpretability poses concerns regarding their safe operations from multiple aspects, e.g., truthfulness, robustness, and fairness. Recent research has started developing quality assurance methods for LLMs, introducing techniques such as offline detector-based or uncertainty estimation methods. However, these approaches predominantly concentrate on post-generation analysis, leaving the online safety analysis for LLMs during the generation phase an unexplored area. To bridge this gap, we conduct in this work a comprehensive evaluation of the effectiveness of existing online safety analysis methods on LLMs. We begin with a pilot study that validates the feasibility of detecting unsafe outputs in the early generation process. Following this, we establish the first publicly available…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Hate Speech and Cyberbullying Detection