SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

Weijie Xu; Shixian Cui; Xi Fang; Chi Xue; Stephanie Eckman; Chandan K. Reddy

arXiv:2506.00643·cs.CL·October 21, 2025

SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

Weijie Xu, Shixian Cui, Xi Fang, Chi Xue, Stephanie Eckman, Chandan K. Reddy

PDF

Open Access 3 Datasets 4 Reviews

TL;DR

SATA-BENCH is a new benchmark for evaluating large language models on Select All That Apply questions, revealing current limitations and proposing a decoding strategy to improve multi-answer accuracy.

Contribution

The paper introduces SATA-BENCH, the first dedicated benchmark for multi-answer questions, and proposes Choice Funnel, a decoding method that enhances LLMs' ability to identify all correct answers.

Findings

01

Models achieve only 41.8% exact match on SATA-BENCH.

02

Choice Funnel improves exact match by up to 29%.

03

The approach reduces inference cost by over 64%.

Abstract

Large language models (LLMs) are increasingly evaluated on single-answer multiple-choice tasks, yet many real-world problems require identifying all correct answers from a set of options. This capability remains underexplored. We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on Select All That Apply (SATA) questions across diverse domains, including reading comprehension, law, and biomedicine. Our evaluation of 27 open-source and proprietary models reveals a significant gap: even the strongest model achieves only 41.8% exact match, exposing LLMs' inability to reliably identify all correct answers. We find that this weakness stems from two core challenges: selection bias - models favor certain choices regardless of content, and count bias - models fail to predict the correct number of answers. To address these issues, we propose Choice Funnel, a decoding…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 2Confidence 4

Strengths

1) The paper addresses a valid or a good gap. Authors point out that most real-world applications (e.g., content moderation, medical diagnosis, legal research) involve identifying multiple valid "answers". Current single-answer benchmarks fail to capture this capability. The creation of a large-scale, human-validated benchmark for this task could be a good resource 2) The curation of SATA-BENCH is explained clearly, involving a multi-stage process of transformation and filtering based on readabi

Weaknesses

• SATA questions can have multiple answers, but it is debatable whether a ranking could exist among the correct answers. A binary prediction of whether a candidate option is correct or not doesn't capture this. This raises a slight concern about the positioning of the multiple-answers task within SATA. • Further, one of the other important concerns regarding the curation of this dataset is that very little information is given on the annotators. What was their background? Did they have exper

Reviewer 02Rating 6Confidence 5

Strengths

* The paper is well-written and easy to understand. It is easy to see the motivation for the work, and the benchmark itself seems thoughtfully constructed. * The authors have considered a variety of evaluative methods. It feels comprehensive. * The authors test against a number of natural baselines. * The benchmark has two qualities that I think are important for a benchmark: (1) there is room for improvement (but it's not too hard), and (2) it provides discriminative insight across models.

Weaknesses

While this is a well-done evaluation of a simple idea, I have two main concerns. The first is just the significance of the work. While I appreciate the careful attention to detail and the reasonably well done empirical work, it seems like a fairly niche question. Frankly, I don't think the authors do a good job of building the case for the need for this benchmark. The second is that this paper seems to be missing an important set of related work, where this problem is termed "multi-label class

Reviewer 03Rating 6Confidence 3

Strengths

The benchmark is novel in its approach to evaluating LLMs in SATA tasks. The analysis and classification of bias (speculation bias, unselection bias and count bias) in the execution of SATA tasks by LLM are clear and well-founded. The process of constructing the dataset is detailed, covering data filtering, readability scoring, and multiple rounds of manual validation. The experimental design covers a wide range, including 18 closed source models and 14 open source models, all of which prove t

Weaknesses

1. The Choice Funnel method relies on token probability and is therefore only applicable to open-source models with accessible probability distributions. 2. The method is similar to traditional greedy selection and early stopping mechanisms, with high similarity to existing selective prediction or multi label output strategies such as probability thresholding and top-k selection. 3. There may be significant differences in the threshold parameter τ between different models and tasks. 4. This pape

Reviewer 04Rating 4Confidence 4

Strengths

* The paper studies an important problem with real-world implications (e.g., medical diagnosis where multiple conditions could be valid options). * The authors make a thorough, end-to-end contribution, spanning benchmark construction, evaluation and a deep-dive on failure modes, and proposing a candidate algorithm to mitigate some observed shortcomings. * SATA-BENCH clearly had a lot of time put into producing a high-quality resource, with the authors taking numerous steps to remove too short or

Weaknesses

* Given the presentation of results in Table 2, it is hard to draw any meaningful conclusions about differences between models. In particular, there is no clear winner, and very little consistency in rank ordering of models between the different metrics. Perhaps the authors consider a clearer visualization of what this table is intended to convey. * Relatedly, there are an awful lot of metrics deployed, and it's often unclear why multiple are required, given they appear to be different implement

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Criteria Decision Making