Bias in the Mirror: Are LLMs opinions robust to their own adversarial attacks ?
Virgile Rennard, Christos Xypolopoulos, Michalis Vazirgiannis

TL;DR
This paper investigates the robustness of biases in large language models by having two models debate opposing views to assess bias persistence and susceptibility to misinformation across different models and languages.
Contribution
It introduces a novel self-debate framework to evaluate bias robustness and explores how biases are reinforced or challenged during interactions.
Findings
Biases can be reinforced or challenged during model debates.
Bias robustness varies across model sizes and languages.
Models show susceptibility to misinformation reinforcement.
Abstract
Large language models (LLMs) inherit biases from their training data and alignment processes, influencing their responses in subtle ways. While many studies have examined these biases, little work has explored their robustness during interactions. In this paper, we introduce a novel approach where two instances of an LLM engage in self-debate, arguing opposing viewpoints to persuade a neutral version of the model. Through this, we evaluate how firmly biases hold and whether models are susceptible to reinforcing misinformation or shifting to harmful viewpoints. Our experiments span multiple LLMs of varying sizes, origins, and languages, providing deeper insights into bias persistence and flexibility across linguistic and cultural contexts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLegal Education and Practice Innovations
