Large Language Models Are Not Robust Multiple Choice Selectors

Chujie Zheng; Hao Zhou; Fandong Meng; Jie Zhou; Minlie Huang

arXiv:2309.03882·cs.CL·February 23, 2024·22 cites

Large Language Models Are Not Robust Multiple Choice Selectors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper reveals that large language models exhibit a bias towards certain option IDs in multiple choice questions, affecting their robustness, and introduces a simple inference-time method called PriDe to mitigate this bias.

Contribution

The paper identifies the token bias causing selection bias in LLMs and proposes PriDe, a label-free, efficient debiasing method that improves robustness in multiple choice tasks.

Findings

01

LLMs prefer specific option IDs due to token bias

02

PriDe effectively reduces selection bias

03

Debiasing improves LLM robustness in MCQs

Abstract

Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs). This work shows that modern LLMs are vulnerable to option position changes in MCQs due to their inherent "selection bias", namely, they prefer to select specific option IDs as answers (like "Option A"). Through extensive empirical analyses with 20 LLMs on three benchmarks, we pinpoint that this behavioral bias primarily stems from LLMs' token bias, where the model a priori assigns more probabilistic mass to specific option ID tokens (e.g., A/B/C/D) when predicting answers from the option IDs. To mitigate selection bias, we propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution. PriDe first estimates the prior by permutating option contents on a small…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

1. It flows! The writing is perfect. All sections follow each other naturally, from problem to observation, to diagnosis, to ruling out simplistic solutions, to proposed solutions. In each step, there are corresponding experiments to substantiate it. 2. There are some clever experiment designs in diagnosing the cause, and the experiments are carried out with caution (e.g. replacing symbols to confirm). 3. Comprehensive experiments on many models and datasets.

Weaknesses

1. When the compute budget is unbounded, the proposed method sometimes has a slight accuracy disadvantage compared to full perm.

Reviewer 02Rating 8· accept, good paperConfidence 2

Strengths

I really appreciate the paper conducted extensive experiments to demonstrate and analyze the Option-Order Sensitivity problem. Some observations are really interesting; for example, even the same models with different parameter sizes but trained using the same data exhibit different position preferences. The PriDe is intuitive but also effective.

Weaknesses

It would be better to cite "Leveraging large language models for multiple choice question answering" or other related papers when mentioning the Option-Order Sensitivity problem since they have found the problem earlier than the work of this paper. It would be better to analyze more technicals, including self-consistency.

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

1. The empirical analysis is thorough, involving 20 LLMs and three benchmark datasets. This extensive evaluation provides strong evidence for the existence of selection bias in LLMs and its impact on their performance in MCQ tasks. The identification of token bias as the primary source of this issue is a valuable insight that can inform future research on LLMs and their limitations. 2. The proposed PriDe method is effective when the computing cost is limited. Further analysis on generalizabilit

Weaknesses

1. It seems that PriDe achieves comparable performance with simple baselines when the computation cost is not limitated. In application scenarios, we always first estimate the prior without concerning the computation cost, then apply this prior to serve applications. It would be better if PriDe could have a higher upper boudn performance.

Code & Models

Repositories

chujiezheng/llm-mcq-bias
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Expert finding and Q&A systems