Hidden in Plain Sight: Exploring Chat History Tampering in Interactive Language Models
Cheng'an Wei, Yue Zhao, Yujia Gong, Kai Chen, Lu Xiang, and Shenchen, Zhu

TL;DR
This paper reveals how chat history tampering in LLMs like ChatGPT and Llama can manipulate model behavior, introduces a systematic method to inject fake history, and demonstrates significant impact on model outputs.
Contribution
It proposes a novel prompt template search method using LLM-guided genetic algorithms to enable chat history tampering in black-box LLMs.
Findings
Tampering can increase disallowed response success rate to 97% on ChatGPT.
Effective templates can be automatically generated and optimized.
Chat history tampering affects model behavior over time.
Abstract
Large Language Models (LLMs) such as ChatGPT and Llama have become prevalent in real-world applications, exhibiting impressive text generation performance. LLMs are fundamentally developed from a scenario where the input data remains static and unstructured. To behave interactively, LLM-based chat systems must integrate prior chat history as context into their inputs, following a pre-defined structure. However, LLMs cannot separate user inputs from context, enabling chat history tampering. This paper introduces a systematic methodology to inject user-supplied history into LLM conversations without any prior knowledge of the target model. The key is to utilize prompt templates that can well organize the messages to be injected, leading the target LLM to interpret them as genuine chat history. To automatically search for effective templates in a WebUI black-box setting, we propose the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare
MethodsLLaMA
