CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue   Coreference

Erxin Yu; Jing Li; Ming Liao; Siqi Wang; Zuchen Gao; Fei Mi; Lanqing; Hong

arXiv:2406.17626·cs.CL·June 26, 2024

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

Erxin Yu, Jing Li, Ming Liao, Siqi Wang, Zuchen Gao, Fei Mi, Lanqing, Hong

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the safety of large language models in multi-turn dialogue scenarios involving coreference, revealing vulnerabilities through a new dataset and evaluation of popular open-source models.

Contribution

It introduces the first dataset and evaluation framework for multi-turn dialogue coreference safety attacks on LLMs, highlighting existing safety gaps.

Findings

01

LLaMA2-Chat-7b has a 56% attack success rate.

02

Mistral-7B-Instruct has a 13.9% attack success rate.

03

Safety vulnerabilities are prevalent in multi-turn dialogue coreference.

Abstract

As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research problem. Previous red-teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ErxinYu/CoSafe-Dataset
noneOfficial

Videos

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference· underline

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Multi-Agent Systems and Negotiation