FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs

Zhiting Fan; Ruizhe Chen; Tianxiang Hu; Zuozhu Liu

arXiv:2410.19317·cs.CL·June 11, 2025

FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs

Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Zuozhu Liu

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces FairMT-Bench, a comprehensive benchmark for assessing fairness in multi-turn dialogue scenarios of LLMs, revealing current challenges and variations in bias across models.

Contribution

It presents a new multi-turn fairness benchmark, dataset, and evaluation framework specifically designed for realistic conversational AI scenarios.

Findings

01

Current LLMs tend to produce more biased responses in multi-turn dialogues.

02

Significant variation exists in fairness performance across different models and tasks.

03

The benchmark and dataset facilitate more realistic assessment of LLM fairness.

Abstract

The growing use of large language model (LLM)-based chatbots has raised concerns about fairness. Fairness issues in LLMs can lead to severe consequences, such as bias amplification, discrimination, and harm to marginalized communities. While existing fairness benchmarks mainly focus on single-turn dialogues, multi-turn scenarios, which in fact better reflect real-world conversations, present greater challenges due to conversational complexity and potential bias accumulation. In this paper, we propose a comprehensive fairness benchmark for LLMs in multi-turn dialogue scenarios, \textbf{FairMT-Bench}. Specifically, we formulate a task taxonomy targeting LLM fairness capabilities across three stages: context understanding, user interaction, and instruction trade-offs, with each stage comprising two tasks. To ensure coverage of diverse bias types and attributes, we draw from existing…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 4

Strengths

1. The paper contributes a novel fairness benchmark specifically for multi-turn dialogues, while current benchmarks primarily focus on single-turn dialogues. 2. The paper extensively benchmarks on most popular LLMs, and provides detailed results and analysis across many dimensions, like tasks, dialogue turns, bias types and attributes. The paper comprehensively demonstrates each LLM's performance, and pinpoints the areas where fairness is challenging to LLMs. The results show that fairness, esp

Weaknesses

1. A few factors can make the evaluation computationally expensive: (1) using GPT-4 as the evaluator (2) the multi-turn nature of the data and the evaluation process (3) the data size. It would be great if the paper can include some discussion on evaluation cost. 2. While the paper discusses diverse sources and dimensions of bias, it does not discuss potential mitigation strategies. Offering even preliminary solutions or suggestions for future research directions would be valuable. 3. As fairn

Reviewer 02Rating 6Confidence 4

Strengths

1. Novel Focus on Multi-Turn Fairness Evaluation: The paper addresses the crucial gap of multi-turn dialogue fairness, reflecting real-world complexities in conversational AI use cases. 2. FairMT-Bench and its datasets (FairMT-10K and FairMT-1K) cover a wide array of bias types and attributes, providing a rich resource for fairness research. 3. By evaluating 15 prominent LLMs, the paper provides a robust, comparative analysis of model fairness, offering valuable insights for future LLM alignment

Weaknesses

1. While the multi-turn focus is novel, the evaluation method largely depends on established LLM tools (e.g., GPT-4 as a judge), which may limit innovation in developing new fairness detection methodologies. 2. The paper does not thoroughly explore why certain attributes (like gender and race) showed consistently poor performance across models, missing an opportunity to deepen the community’s understanding of these biases. 3. Relying heavily on GPT-4 for generating synthetic dialogue data could

Reviewer 03Rating 8Confidence 4

Strengths

1. **Valuable Resources** This paper first presents a fairness benchmark in multi-turn dialogue scenarios, covering diverse bias types and attributes. 2. **Extensive Experiments** Conduct comprehensive experiments on current SOTA LLMs across six designed tasks. 3. **Reliable Evaluation** Use GPT4 as a Judge, alongside bias classifiers including Llama3-Guard-3 and human validation. 4. **Comprehensive Analysis** Analyze evaluation results of single-turn and multi-turn dialogue across different mod

Weaknesses

**Ambiguous Task Taxonomy** In Section 3.2, two taxonomies about fairness tasks are primarily discussed: comprehension-focused tasks VS. bias-resistance tasks (Line 318-320) and implicit biases VS. explicit biases (Line 322-323). These taxonomies are clear and reasonable. However, the taxonomy outlined in section 2.1 lacks clarity and mention. The naming of "interaction fairness" class is somewhat confusing, and the boundaries between this class and the other two are not clearly defined.

Videos

FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs· slideslive

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Dispute Resolution and Class Actions

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Multi-Head Attention · Softmax · Adam