Large Language Models Are Not Robust Multiple Choice Selectors
Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang

TL;DR
This paper reveals that large language models exhibit a bias towards certain option IDs in multiple choice questions, affecting their robustness, and introduces a simple inference-time method called PriDe to mitigate this bias.
Contribution
The paper identifies the token bias causing selection bias in LLMs and proposes PriDe, a label-free, efficient debiasing method that improves robustness in multiple choice tasks.
Findings
LLMs prefer specific option IDs due to token bias
PriDe effectively reduces selection bias
Debiasing improves LLM robustness in MCQs
Abstract
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs). This work shows that modern LLMs are vulnerable to option position changes in MCQs due to their inherent "selection bias", namely, they prefer to select specific option IDs as answers (like "Option A"). Through extensive empirical analyses with 20 LLMs on three benchmarks, we pinpoint that this behavioral bias primarily stems from LLMs' token bias, where the model a priori assigns more probabilistic mass to specific option ID tokens (e.g., A/B/C/D) when predicting answers from the option IDs. To mitigate selection bias, we propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution. PriDe first estimates the prior by permutating option contents on a small…
Peer Reviews
Decision·ICLR 2024 spotlight
1. It flows! The writing is perfect. All sections follow each other naturally, from problem to observation, to diagnosis, to ruling out simplistic solutions, to proposed solutions. In each step, there are corresponding experiments to substantiate it. 2. There are some clever experiment designs in diagnosing the cause, and the experiments are carried out with caution (e.g. replacing symbols to confirm). 3. Comprehensive experiments on many models and datasets.
1. When the compute budget is unbounded, the proposed method sometimes has a slight accuracy disadvantage compared to full perm.
I really appreciate the paper conducted extensive experiments to demonstrate and analyze the Option-Order Sensitivity problem. Some observations are really interesting; for example, even the same models with different parameter sizes but trained using the same data exhibit different position preferences. The PriDe is intuitive but also effective.
It would be better to cite "Leveraging large language models for multiple choice question answering" or other related papers when mentioning the Option-Order Sensitivity problem since they have found the problem earlier than the work of this paper. It would be better to analyze more technicals, including self-consistency.
1. The empirical analysis is thorough, involving 20 LLMs and three benchmark datasets. This extensive evaluation provides strong evidence for the existence of selection bias in LLMs and its impact on their performance in MCQ tasks. The identification of token bias as the primary source of this issue is a valuable insight that can inform future research on LLMs and their limitations. 2. The proposed PriDe method is effective when the computing cost is limited. Further analysis on generalizabilit
1. It seems that PriDe achieves comparable performance with simple baselines when the computation cost is not limitated. In application scenarios, we always first estimate the prior without concerning the computation cost, then apply this prior to serve applications. It would be better if PriDe could have a higher upper boudn performance.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Expert finding and Q&A systems
