Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities
Lili Zhang, Haomiaomiao Wang, Long Cheng, Libao Deng, Tomas Ward

TL;DR
This paper presents an adversarial evaluation framework for LLMs that tests their decision-making robustness and strategic flexibility in interactive scenarios, revealing vulnerabilities and behavioral patterns.
Contribution
It introduces a novel methodology inspired by psychology and game theory to systematically diagnose decision-making weaknesses in LLMs under adversarial conditions.
Findings
Models show specific susceptibilities to manipulation.
Behavioral patterns vary significantly across models.
Framework highlights importance of adaptability and fairness.
Abstract
As Large Language Models (LLMs) become increasingly integrated into real-world decision-making systems, understanding their behavioural vulnerabilities remains a critical challenge for AI safety and alignment. While existing evaluation metrics focus primarily on reasoning accuracy or factual correctness, they often overlook whether LLMs are robust to adversarial manipulation or capable of using adaptive strategy in dynamic environments. This paper introduces an adversarial evaluation framework designed to systematically stress-test the decision-making processes of LLMs under interactive and adversarial conditions. Drawing on methodologies from cognitive psychology and game theory, our framework probes how models respond in two canonical tasks: the two-armed bandit task and the Multi-Round Trust Task. These tasks capture key aspects of exploration-exploitation trade-offs, social…
Peer Reviews
Decision·Submitted to ICLR 2025
The proposed framework looks interesting. The problem of LLM decision making with adversarial testing is new.
Prompt Robustness: LLMs can be very sensitive to the prompt design. Is the observation a product of particular prompt design? The paper has only used a single prompt for both the tasks. A generic framework could be designed where prompts can be varied and new scenarios added to check the robustness of the results. Please create variations of the scenarios to make sure the observations are indeed generic. Please also give the temperature of the LLM used and if possible make more runs to report t
1. The problem setting is relevant and interesting. As LLMs scale and improve, it is a generally interesting question to understand their decision-making processes, particularly under adversarial settings. 2. The results, although in toy settings, are noteworthy. It is notable that it is possible to train a powerful adversary, and that an RNN-model can replicate the decision-making processes of a LLM to a certain degree.
1. **Writing**: Writing is very poor and often redundant. The *abstract* and *conclusion* are longer than needed and do not effectively summarize the work. On the other hand, the *method* section does not build up motivation properly and dives straight into notation without a proper setup of the problem. I also think that a separate *related work* section should be added, and some content should be moved over from the introduction. The current flow of the *introduction* is not smooth. There are
1. The paper focuses on a good aspect when using LLM agents to solve online decision-making problems with adversarial environments. 2. The paper is overall well-written and easy to understand. 3. The behavioral analysis involving human beings is interesting.
1. The paper could be strengthened by including more recent and relevant studies, such as Krishnamurthy, Akshay, Keegan Harris, Dylan J. Foster, Cyril Zhang, and Aleksandrs Slivkins. "Can large language models explore in-context?." arXiv preprint arXiv:2403.15371 (2024). Despite it is an arXiv article to the best of my knowledge, this is among the first few papers seriously discussing applying LLMs to bandit-related problems. 2. Consider incorporating and evaluating more advanced LLMs, such a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Label Smoothing · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Cosine Annealing · Attention Dropout · Residual Connection
