Bias in the Mirror: Are LLMs opinions robust to their own adversarial   attacks ?

Virgile Rennard; Christos Xypolopoulos; Michalis Vazirgiannis

arXiv:2410.13517·cs.CL·November 6, 2024

Bias in the Mirror: Are LLMs opinions robust to their own adversarial attacks ?

Virgile Rennard, Christos Xypolopoulos, Michalis Vazirgiannis

PDF

Open Access 1 Video

TL;DR

This paper investigates the robustness of biases in large language models by having two models debate opposing views to assess bias persistence and susceptibility to misinformation across different models and languages.

Contribution

It introduces a novel self-debate framework to evaluate bias robustness and explores how biases are reinforced or challenged during interactions.

Findings

01

Biases can be reinforced or challenged during model debates.

02

Bias robustness varies across model sizes and languages.

03

Models show susceptibility to misinformation reinforcement.

Abstract

Large language models (LLMs) inherit biases from their training data and alignment processes, influencing their responses in subtle ways. While many studies have examined these biases, little work has explored their robustness during interactions. In this paper, we introduce a novel approach where two instances of an LLM engage in self-debate, arguing opposing viewpoints to persuade a neutral version of the model. Through this, we evaluate how firmly biases hold and whether models are susceptible to reinforcing misinformation or shifting to harmful viewpoints. Our experiments span multiple LLMs of varying sizes, origins, and languages, providing deeper insights into bias persistence and flexibility across linguistic and cultural contexts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Bias in the Mirror : Are LLMs opinions robust to their own adversarial attacks· underline

Taxonomy

TopicsLegal Education and Practice Innovations