SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions
Weijie Xu, Shixian Cui, Xi Fang, Chi Xue, Stephanie Eckman, Chandan K. Reddy

TL;DR
SATA-BENCH is a new benchmark for evaluating large language models on Select All That Apply questions, revealing current limitations and proposing a decoding strategy to improve multi-answer accuracy.
Contribution
The paper introduces SATA-BENCH, the first dedicated benchmark for multi-answer questions, and proposes Choice Funnel, a decoding method that enhances LLMs' ability to identify all correct answers.
Findings
Models achieve only 41.8% exact match on SATA-BENCH.
Choice Funnel improves exact match by up to 29%.
The approach reduces inference cost by over 64%.
Abstract
Large language models (LLMs) are increasingly evaluated on single-answer multiple-choice tasks, yet many real-world problems require identifying all correct answers from a set of options. This capability remains underexplored. We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on Select All That Apply (SATA) questions across diverse domains, including reading comprehension, law, and biomedicine. Our evaluation of 27 open-source and proprietary models reveals a significant gap: even the strongest model achieves only 41.8% exact match, exposing LLMs' inability to reliably identify all correct answers. We find that this weakness stems from two core challenges: selection bias - models favor certain choices regardless of content, and count bias - models fail to predict the correct number of answers. To address these issues, we propose Choice Funnel, a decoding…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1) The paper addresses a valid or a good gap. Authors point out that most real-world applications (e.g., content moderation, medical diagnosis, legal research) involve identifying multiple valid "answers". Current single-answer benchmarks fail to capture this capability. The creation of a large-scale, human-validated benchmark for this task could be a good resource 2) The curation of SATA-BENCH is explained clearly, involving a multi-stage process of transformation and filtering based on readabi
• SATA questions can have multiple answers, but it is debatable whether a ranking could exist among the correct answers. A binary prediction of whether a candidate option is correct or not doesn't capture this. This raises a slight concern about the positioning of the multiple-answers task within SATA. • Further, one of the other important concerns regarding the curation of this dataset is that very little information is given on the annotators. What was their background? Did they have exper
* The paper is well-written and easy to understand. It is easy to see the motivation for the work, and the benchmark itself seems thoughtfully constructed. * The authors have considered a variety of evaluative methods. It feels comprehensive. * The authors test against a number of natural baselines. * The benchmark has two qualities that I think are important for a benchmark: (1) there is room for improvement (but it's not too hard), and (2) it provides discriminative insight across models.
While this is a well-done evaluation of a simple idea, I have two main concerns. The first is just the significance of the work. While I appreciate the careful attention to detail and the reasonably well done empirical work, it seems like a fairly niche question. Frankly, I don't think the authors do a good job of building the case for the need for this benchmark. The second is that this paper seems to be missing an important set of related work, where this problem is termed "multi-label class
The benchmark is novel in its approach to evaluating LLMs in SATA tasks. The analysis and classification of bias (speculation bias, unselection bias and count bias) in the execution of SATA tasks by LLM are clear and well-founded. The process of constructing the dataset is detailed, covering data filtering, readability scoring, and multiple rounds of manual validation. The experimental design covers a wide range, including 18 closed source models and 14 open source models, all of which prove t
1. The Choice Funnel method relies on token probability and is therefore only applicable to open-source models with accessible probability distributions. 2. The method is similar to traditional greedy selection and early stopping mechanisms, with high similarity to existing selective prediction or multi label output strategies such as probability thresholding and top-k selection. 3. There may be significant differences in the threshold parameter τ between different models and tasks. 4. This pape
* The paper studies an important problem with real-world implications (e.g., medical diagnosis where multiple conditions could be valid options). * The authors make a thorough, end-to-end contribution, spanning benchmark construction, evaluation and a deep-dive on failure modes, and proposing a candidate algorithm to mitigate some observed shortcomings. * SATA-BENCH clearly had a lot of time put into producing a high-quality resource, with the authors taking numerous steps to remove too short or
* Given the presentation of results in Table 2, it is hard to draw any meaningful conclusions about differences between models. In particular, there is no clear winner, and very little consistency in rank ordering of models between the different metrics. Perhaps the authors consider a clearer visualization of what this table is intended to convey. * Relatedly, there are an awful lot of metrics deployed, and it's often unclear why multiple are required, given they appear to be different implement
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making
