Self-Reflection Makes Large Language Models Safer, Less Biased, and   Ideologically Neutral

Fengyuan Liu; Nouar AlDahoul; Gregory Eady; Yasir Zaki; Talal Rahwan

arXiv:2406.10400·cs.CL·February 18, 2025·1 cites

Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral

Fengyuan Liu, Nouar AlDahoul, Gregory Eady, Yasir Zaki, Talal Rahwan

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that self-reflection in large language models can significantly improve safety, reduce bias, and neutralize ideological leaning, although its effectiveness on reasoning depends on prompt wording and complexity.

Contribution

The study systematically evaluates self-reflection in LLMs, revealing its potential for safety and bias mitigation and clarifying its impact on reasoning performance with prompt sensitivity.

Findings

01

Self-reflection reduces toxic responses by 75.8%.

02

Self-reflection decreases gender bias by 77%.

03

Self-reflection eliminates partisan bias completely.

Abstract

Previous studies proposed that the reasoning capabilities of large language models (LLMs) can be improved through self-reflection, i.e., letting LLMs reflect on their own output to identify and correct mistakes in the initial responses. However, earlier experiments offer mixed results when it comes to the benefits of self-reflection. Furthermore, prior studies on self-reflection are predominantly concerned with the reasoning capabilities of models, ignoring the potential for self-reflection in safety, bias, and ideological leaning. Here, by conducting a series of experiments testing LLM's self-reflection capability in various tasks using a variety of prompts and different LLMs, we make several contributions to the literature. First, we reconcile conflicting findings regarding the benefit of self-reflection, by demonstrating that the outcome of self-reflection is sensitive to prompt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

michael98liu/mixture-of-prompts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational and Psychological Assessments