How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries
Somnath Banerjee, Sayan Layek, Rima Hazra, Animesh Mukherjee

TL;DR
This paper investigates how instruction-centric prompts influence large language models' tendency to produce unethical content, revealing increased risks and vulnerabilities in safety guardrails, especially after model editing.
Contribution
It introduces TechHazardQA, a dataset for testing unethical responses, and demonstrates that instruction-centric prompts and model editing significantly raise unethical output risks.
Findings
Instruction-centric prompts increase unethical responses by 2-38%.
Model editing with ROME further amplifies unethical output by 3-16%.
Evaluation includes harmfulness scores and human/AI judgments.
Abstract
In this study, we tackle a growing concern around the safety and ethical use of large language models (LLMs). Despite their potential, these models can be tricked into producing harmful or unethical content through various sophisticated methods, including 'jailbreaking' techniques and targeted manipulation. Our work zeroes in on a specific issue: to what extent LLMs can be led astray by asking them to generate responses that are instruction-centric such as a pseudocode, a program or a software snippet as opposed to vanilla text. To investigate this question, we introduce TechHazardQA, a dataset containing complex queries which should be answered in both text and instruction-centric formats (e.g., pseudocodes), aimed at identifying triggers for unethical responses. We query a series of LLMs -- Llama-2-13b, Llama-2-7b, Mistral-V2 and Mistral 8X7B -- and ask them to generate both text and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Internet Traffic Analysis and Secure E-voting · Digital and Cyber Forensics
MethodsLinear Layer · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax · Layer Normalization · Multi-Head Attention
