Unveiling the Misuse Potential of Base Large Language Models via   In-Context Learning

Xiao Wang; Tianze Chen; Xianjun Yang; Qi Zhang; Xun Zhao; Dahua Lin

arXiv:2404.10552·cs.CL·April 17, 2024·1 cites

Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning

Xiao Wang, Tianze Chen, Xianjun Yang, Qi Zhang, Xun Zhao, Dahua Lin

PDF

Open Access

TL;DR

This paper reveals that base large language models can interpret and execute malicious instructions through in-context learning, posing significant security risks despite their lack of alignment, and introduces new risk evaluation metrics.

Contribution

The study uncovers the misuse potential of base LLMs via in-context learning and develops novel metrics to assess associated risks.

Findings

01

Base LLMs can interpret malicious instructions effectively.

02

Risk levels of base LLM outputs are comparable to fine-tuned malicious models.

03

Misuse potential exists without specialized knowledge or training.

Abstract

The open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. This includes both base models, which are pre-trained on extensive datasets without alignment, and aligned models, deliberately designed to align with ethical standards and human values. Contrary to the prevalent assumption that the inherent instruction-following limitations of base LLMs serve as a safeguard against misuse, our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions. To systematically assess these risks, we introduce a novel set of risk evaluation metrics. Empirical results reveal that the outputs from base LLMs can exhibit risk levels on par with those of models fine-tuned for malicious…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Hate Speech and Cyberbullying Detection

MethodsSparse Evolutionary Training · ALIGN · Balanced Selection