Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities
Yunxiang Yan, Tomohiro Sawada, Kartik Goyal

TL;DR
This paper introduces a cascaded question disclosure framework that improves the evaluation of large language models' problem-solving abilities by providing more accurate, stagewise reasoning insights compared to traditional QA benchmarks.
Contribution
It proposes a novel cascaded question disclosure method that offers a more precise and generalizable evaluation of LLMs' problem-solving capabilities, surpassing standard QA benchmarks.
Findings
Better comparison of LLMs' reasoning abilities
Induces more informative intermediate traces in models
Narrows performance gaps observed in standard evaluations
Abstract
While question-answering~(QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on \emph{cascaded question disclosure} that provides a more accurate estimate of the models' problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making · Big Data and Business Intelligence
