How (un)ethical are instruction-centric responses of LLMs? Unveiling the   vulnerabilities of safety guardrails to harmful queries

Somnath Banerjee; Sayan Layek; Rima Hazra; Animesh Mukherjee

arXiv:2402.15302·cs.CL·November 19, 2024·3 cites

How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries

Somnath Banerjee, Sayan Layek, Rima Hazra, Animesh Mukherjee

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper investigates how instruction-centric prompts influence large language models' tendency to produce unethical content, revealing increased risks and vulnerabilities in safety guardrails, especially after model editing.

Contribution

It introduces TechHazardQA, a dataset for testing unethical responses, and demonstrates that instruction-centric prompts and model editing significantly raise unethical output risks.

Findings

01

Instruction-centric prompts increase unethical responses by 2-38%.

02

Model editing with ROME further amplifies unethical output by 3-16%.

03

Evaluation includes harmfulness scores and human/AI judgments.

Abstract

In this study, we tackle a growing concern around the safety and ethical use of large language models (LLMs). Despite their potential, these models can be tricked into producing harmful or unethical content through various sophisticated methods, including 'jailbreaking' techniques and targeted manipulation. Our work zeroes in on a specific issue: to what extent LLMs can be led astray by asking them to generate responses that are instruction-centric such as a pseudocode, a program or a software snippet as opposed to vanilla text. To investigate this question, we introduce TechHazardQA, a dataset containing complex queries which should be answered in both text and instruction-centric formats (e.g., pseudocodes), aimed at identifying triggers for unethical responses. We query a series of LLMs -- Llama-2-13b, Llama-2-7b, Mistral-V2 and Mistral 8X7B -- and ask them to generate both text and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/SoftMINER-Group/TechHazardQA
noneOfficial

Datasets

SoftMINER-Group/TechHazardQA
dataset· 57 dl
57 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Internet Traffic Analysis and Secure E-voting · Digital and Cyber Forensics

MethodsLinear Layer · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax · Layer Normalization · Multi-Head Attention