LLMs are Vulnerable to Malicious Prompts Disguised as Scientific   Language

Yubin Ge; Neeraja Kirtane; Hao Peng; Dilek Hakkani-T\"ur

arXiv:2501.14073·cs.CL·February 19, 2025

LLMs are Vulnerable to Malicious Prompts Disguised as Scientific Language

Yubin Ge, Neeraja Kirtane, Hao Peng, Dilek Hakkani-T\"ur

PDF

Open Access

TL;DR

This paper demonstrates that state-of-the-art large language models are vulnerable to malicious prompts disguised as scientific language, leading to increased biases and toxicity, and highlights the need for careful training data considerations.

Contribution

It reveals the susceptibility of LLMs to scientifically disguised malicious prompts and analyzes factors contributing to these vulnerabilities, emphasizing the importance of training data scrutiny.

Findings

01

Models' biases and toxicity increase with malicious scientific prompts

02

Models can be manipulated to generate fabricated scientific arguments

03

Mentioning author names and venues amplifies model biases

Abstract

As large language models (LLMs) have been deployed in various real-world settings, concerns about the harm they may propagate have grown. Various jailbreaking techniques have been developed to expose the vulnerabilities of these models and improve their safety. This work reveals that many state-of-the-art LLMs are vulnerable to malicious requests hidden behind scientific language. Specifically, our experiments with GPT4o, GPT4o-mini, GPT-4, LLama3-405B-Instruct, Llama3-70B-Instruct, Cohere, Gemini models demonstrate that, the models' biases and toxicity substantially increase when prompted with requests that deliberately misinterpret social science and psychological studies as evidence supporting the benefits of stereotypical biases. Alarmingly, these models can also be manipulated to generate fabricated scientific arguments claiming that biases are beneficial, which can be used by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Network Security and Intrusion Detection · Advanced Malware Detection Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Discriminative Fine-Tuning · Cosine Annealing · Softmax · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer