Towards detecting unanticipated bias in Large Language Models

Anna Kruspe

arXiv:2404.02650·cs.LG·April 4, 2024·3 cites

Towards detecting unanticipated bias in Large Language Models

Anna Kruspe

PDF

Open Access

TL;DR

This paper investigates new methods to detect hidden, unanticipated biases in Large Language Models by leveraging Uncertainty Quantification and Explainable AI techniques to improve fairness and transparency.

Contribution

It introduces novel approaches focusing on Uncertainty Quantification and Explainable AI to identify subtle biases in LLMs that are difficult to detect with existing methods.

Findings

01

Uncertainty measures can reveal biased model behaviors.

02

Explainability techniques help uncover hidden biases.

03

Proposed methods improve bias detection in LLMs.

Abstract

Over the last year, Large Language Models (LLMs) like ChatGPT have become widely available and have exhibited fairness issues similar to those in previous machine learning systems. Current research is primarily focused on analyzing and quantifying these biases in training data and their impact on the decisions of these models, alongside developing mitigation strategies. This research largely targets well-known biases related to gender, race, ethnicity, and language. However, it is clear that LLMs are also affected by other, less obvious implicit biases. The complex and often opaque nature of these models makes detecting such biases challenging, yet this is crucial due to their potential negative impact in various applications. In this paper, we explore new avenues for detecting these unanticipated biases in LLMs, focusing specifically on Uncertainty Quantification and Explainable AI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods