Exploring the Adversarial Capabilities of Large Language Models

Lukas Struppek; Minh Hieu Le; Dominik Hintersdorf; Kristian Kersting

arXiv:2402.09132·cs.AI·July 9, 2024·1 cites

Exploring the Adversarial Capabilities of Large Language Models

Lukas Struppek, Minh Hieu Le, Dominik Hintersdorf, Kristian Kersting

PDF

Open Access

TL;DR

This paper investigates the adversarial capabilities of large language models, revealing their ability to generate perturbations that can bypass safety measures like hate speech detection, raising concerns for autonomous systems.

Contribution

It demonstrates that publicly available LLMs can craft adversarial examples to undermine safety systems, a previously underexplored security risk.

Findings

01

LLMs can generate adversarial perturbations that fool hate speech detectors.

02

Adversarial attacks by LLMs pose challenges to safety and security in autonomous systems.

03

The study highlights the need for improved robustness in safety measures against LLM-based attacks.

Abstract

The proliferation of large language models (LLMs) has sparked widespread and general interest due to their strong language generation capabilities, offering great potential for both industry and research. While previous research delved into the security and privacy issues of LLMs, the extent to which these models can exhibit adversarial behavior remains largely unexplored. Addressing this gap, we investigate whether common publicly available LLMs have inherent capabilities to perturb text samples to fool safety measures, so-called adversarial examples resp.~attacks. More specifically, we investigate whether LLMs are inherently able to craft adversarial examples out of benign samples to fool existing safe rails. Our experiments, which focus on hate speech detection, reveal that LLMs succeed in finding adversarial perturbations, effectively undermining hate speech detection systems. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Anomaly Detection Techniques and Applications

MethodsFocus