Too Big to Fool: Resisting Deception in Language Models
Mohammad Reza Samsami, Mats Leon Richter, Juan Rodriguez, Megh, Thakkar, Sarath Chandar, Maxime Gasse

TL;DR
This paper explores how larger language models are more resistant to deception and better at integrating prompt information with their internal knowledge, improving their accuracy and robustness against misleading inputs.
Contribution
It demonstrates that larger models are more resilient to deceptive prompts and effectively leverage implicit information, not just memorization, advancing understanding of model robustness.
Findings
Larger models show higher resilience to deceptive prompts.
Larger models outperform smaller ones in following legitimate instructions.
Resilience is due to better use of implicit prompt information, not memorization.
Abstract
Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. This paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper provides valuable insights into the scaling effects on model resilience to misleading prompts, complementing existing LLM scaling studies. 2. The paper is well-organized and clearly written. 3. The main and additional experiments thoroughly validate the central hypothesis.
1. **Marginal Contribution**: Compared with previous works investigating LLM vulnerability to prompt manipulation, as explained in Section 2, it seems the most significant difference lies in the sole focus on parameter scaling effects, specifically 2B-70B models. It is expected that larger models, benefiting from refined training, demonstrate improved knowledge retention, instruction following, and in-context learning, following the emergent abilities analysis [1]. While the scaling investigati
1) The paper provides comparison of the deception from smaller language models to large language models. 2) The idea of providing misleading information to the LLMs and testing their behaviour is interesting. 3) The paper presents wide range of ideas or possibilities with the deception and have presented the analysis which is quite interesting. 4) The datasets cover wide range of domains like maths, science and other questions as well.
1) The authors have used multi choice question type for the deception analysis they would have used question answering datasets where input is text and output is also a text but not multi choice. 2) The experimentation and prompting looks really not so suitable. If you are prompting incorrect answer as an hint and using it to answer the correct option would not be the right way of prompting. Instead you can prompt the process like "refrigerator does not conduct electricity or does not contain
- The study addresses a pertinent issue in the deployment of LLMs—robustness against misleading or deceptive information. With the increasing use of LLMs across various applications, understanding their behavior in such scenarios is essential for ensuring reliability and safety, highlighting the relevance and timeliness of the study. - The authors present a clear and structured evaluation framework. By leveraging controlled experiments and multiple-choice benchmarks, they systematically analyze
- Insufficient Exploration of Underlying Mechanisms: The assertion that larger models possess more robust "world models" is a plausible hypothesis but remains inadequately supported by the experiments presented. While the control tests (e.g., context removal, directive instructions) attempt to eliminate explanations like memorization, they do not sufficiently clarify how models integrate conflicting information. To substantiate claims about the mechanisms underlying resilience, more detailed ana
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ethics and Social Impacts of AI · Artificial Intelligence in Law
