Diagnosing and Debiasing Corpus-Based Political Bias and Insults in GPT2

Ambri Ma; Arnav Kumar; Brett Zeligson

arXiv:2311.10266·cs.CL·November 20, 2023·1 cites

Diagnosing and Debiasing Corpus-Based Political Bias and Insults in GPT2

Ambri Ma, Arnav Kumar, Brett Zeligson

PDF

Open Access

TL;DR

This paper explores methods for detecting and reducing political bias and insults in GPT-2 models, enhancing their ethical and social responsibility by improving self-diagnosis and self-debiasing techniques.

Contribution

It extends existing self-diagnosis and self-debiasing methods to effectively mitigate political bias and insults in GPT-2, addressing gaps in bias types handled by prior work.

Findings

01

Self-diagnosis can identify biases in generated content.

02

Self-debiasing reduces the likelihood of harmful outputs.

03

Effective mitigation of political bias and insults in GPT-2.

Abstract

The training of large language models (LLMs) on extensive, unfiltered corpora sourced from the internet is a common and advantageous practice. Consequently, LLMs have learned and inadvertently reproduced various types of biases, including violent, offensive, and toxic language. However, recent research shows that generative pretrained transformer (GPT) language models can recognize their own biases and detect toxicity in generated content, a process referred to as self-diagnosis. In response, researchers have developed a decoding algorithm that allows LLMs to self-debias, or reduce their likelihood of generating harmful text. This study investigates the efficacy of the diagnosing-debiasing approach in mitigating two additional types of biases: insults and political bias. These biases are often used interchangeably in discourse, despite exhibiting potentially dissimilar semantic and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Text Readability and Simplification