Large Pre-trained Language Models Contain Human-like Biases of What is Right and Wrong to Do
Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin A., Rothkopf, Kristian Kersting

TL;DR
This paper reveals that large pre-trained language models encode human-like moral biases, which can be geometrically captured and used to guide models towards producing more normative, less toxic text.
Contribution
It introduces the concept of a 'moral direction' in embedding space, enabling the detection and mitigation of biased and toxic behaviors in language models.
Findings
Moral norms are encoded as a geometric direction in embedding space.
The moral direction correlates with societal norms expressed in training data.
Using the moral direction reduces toxic degeneration in GPT-2.
Abstract
Artificial writing is permeating our lives due to recent advances in large-scale, transformer-based language models (LMs) such as BERT, its variants, GPT-2/3, and others. Using them as pre-trained models and fine-tuning them for specific tasks, researchers have extended state of the art for many NLP tasks and shown that they capture not only linguistic knowledge but also retain general knowledge implicitly present in the data. Unfortunately, LMs trained on unfiltered text corpora suffer from degenerated and biased behaviour. While this is well established, we show that recent LMs also contain human-like biases of what is right and wrong to do, some form of ethical and moral norms of the society -- they bring a "moral direction" to surface. That is, we show that these norms can be captured geometrically by a direction, which can be computed, e.g., by a PCA, in the embedding space,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education
MethodsLinear Layer · Principal Components Analysis · Byte Pair Encoding · Discriminative Fine-Tuning · Cosine Annealing · Linear Warmup With Cosine Annealing · GPT-2 · Residual Connection · Layer Normalization · WordPiece
