Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning
Xiao Wang, Tianze Chen, Xianjun Yang, Qi Zhang, Xun Zhao, Dahua Lin

TL;DR
This paper reveals that base large language models can interpret and execute malicious instructions through in-context learning, posing significant security risks despite their lack of alignment, and introduces new risk evaluation metrics.
Contribution
The study uncovers the misuse potential of base LLMs via in-context learning and develops novel metrics to assess associated risks.
Findings
Base LLMs can interpret malicious instructions effectively.
Risk levels of base LLM outputs are comparable to fine-tuned malicious models.
Misuse potential exists without specialized knowledge or training.
Abstract
The open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. This includes both base models, which are pre-trained on extensive datasets without alignment, and aligned models, deliberately designed to align with ethical standards and human values. Contrary to the prevalent assumption that the inherent instruction-following limitations of base LLMs serve as a safeguard against misuse, our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions. To systematically assess these risks, we introduce a novel set of risk evaluation metrics. Empirical results reveal that the outputs from base LLMs can exhibit risk levels on par with those of models fine-tuned for malicious…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Hate Speech and Cyberbullying Detection
MethodsSparse Evolutionary Training · ALIGN · Balanced Selection
