Large Pre-trained Language Models Contain Human-like Biases of What is   Right and Wrong to Do

Patrick Schramowski; Cigdem Turan; Nico Andersen; Constantin A.; Rothkopf; Kristian Kersting

arXiv:2103.11790·cs.CL·February 15, 2022

Large Pre-trained Language Models Contain Human-like Biases of What is Right and Wrong to Do

Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin A., Rothkopf, Kristian Kersting

PDF

Open Access 1 Repo

TL;DR

This paper reveals that large pre-trained language models encode human-like moral biases, which can be geometrically captured and used to guide models towards producing more normative, less toxic text.

Contribution

It introduces the concept of a 'moral direction' in embedding space, enabling the detection and mitigation of biased and toxic behaviors in language models.

Findings

01

Moral norms are encoded as a geometric direction in embedding space.

02

The moral direction correlates with societal norms expressed in training data.

03

Using the moral direction reduces toxic degeneration in GPT-2.

Abstract

Artificial writing is permeating our lives due to recent advances in large-scale, transformer-based language models (LMs) such as BERT, its variants, GPT-2/3, and others. Using them as pre-trained models and fine-tuning them for specific tasks, researchers have extended state of the art for many NLP tasks and shown that they capture not only linguistic knowledge but also retain general knowledge implicitly present in the data. Unfortunately, LMs trained on unfiltered text corpora suffer from degenerated and biased behaviour. While this is well established, we show that recent LMs also contain human-like biases of what is right and wrong to do, some form of ethical and moral norms of the society -- they bring a "moral direction" to surface. That is, we show that these norms can be captured geometrically by a direction, which can be computed, e.g., by a PCA, in the embedding space,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ml-research/MoRT_NMI
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education

MethodsLinear Layer · Principal Components Analysis · Byte Pair Encoding · Discriminative Fine-Tuning · Cosine Annealing · Linear Warmup With Cosine Annealing · GPT-2 · Residual Connection · Layer Normalization · WordPiece