Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs
Alina Fastowski, Bardh Prenkaj, Yuxiao Li, Gjergji Kasneci

TL;DR
This paper introduces Xmera, a novel framework for evaluating adversarial man-in-the-middle attacks on LLMs' factual memory, revealing vulnerabilities and proposing a detection method based on response uncertainty.
Contribution
The paper presents the first principled attack evaluation framework for LLMs' factual recall under prompt injection, demonstrating high success rates and proposing a defense using response uncertainty classifiers.
Findings
Trivial instruction-based attacks achieve up to 85.3% success rate.
Response uncertainty can effectively distinguish attacked from unattacked queries with 94.8% AUC.
Signaling users about answer uncertainty enhances LLM safety.
Abstract
LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Spam and Phishing Detection
