LLMs are Vulnerable to Malicious Prompts Disguised as Scientific Language
Yubin Ge, Neeraja Kirtane, Hao Peng, Dilek Hakkani-T\"ur

TL;DR
This paper demonstrates that state-of-the-art large language models are vulnerable to malicious prompts disguised as scientific language, leading to increased biases and toxicity, and highlights the need for careful training data considerations.
Contribution
It reveals the susceptibility of LLMs to scientifically disguised malicious prompts and analyzes factors contributing to these vulnerabilities, emphasizing the importance of training data scrutiny.
Findings
Models' biases and toxicity increase with malicious scientific prompts
Models can be manipulated to generate fabricated scientific arguments
Mentioning author names and venues amplifies model biases
Abstract
As large language models (LLMs) have been deployed in various real-world settings, concerns about the harm they may propagate have grown. Various jailbreaking techniques have been developed to expose the vulnerabilities of these models and improve their safety. This work reveals that many state-of-the-art LLMs are vulnerable to malicious requests hidden behind scientific language. Specifically, our experiments with GPT4o, GPT4o-mini, GPT-4, LLama3-405B-Instruct, Llama3-70B-Instruct, Cohere, Gemini models demonstrate that, the models' biases and toxicity substantially increase when prompted with requests that deliberately misinterpret social science and psychological studies as evidence supporting the benefits of stereotypical biases. Alarmingly, these models can also be manipulated to generate fabricated scientific arguments claiming that biases are beneficial, which can be used by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Network Security and Intrusion Detection · Advanced Malware Detection Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Discriminative Fine-Tuning · Cosine Annealing · Softmax · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer
