Evaluating LLMs with Multiple Problems at once
Zhengxiang Wang, Jordan Kodner, Owen Rambow

TL;DR
This paper advocates for multi-problem evaluation (MPE) of LLMs, introduces the ZeMPE benchmark with 53,100 prompts, and systematically assesses 13 models, revealing strengths and limitations in handling multiple problems simultaneously.
Contribution
It introduces the ZeMPE benchmark for zero-shot multi-problem evaluation and provides a comprehensive analysis of LLMs' capabilities in handling multiple problems at once.
Findings
LLMs can handle multiple problems from a single source.
Performance varies under different conditions.
Model-level factors influence multi-problem handling.
Abstract
This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once, a paradigm we call multi-problem evaluation (MPE). Unlike conventional single-problem evaluation, where a prompt presents a single problem and expects one specific answer, MPE places multiple problems together in a single prompt and assesses how well an LLM answers all these problems in a single output. Leveraging 6 classification and 12 reasoning benchmarks that already exist, we introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts. We experiment with a total of 13 LLMs from 5 model families on ZeMPE to present a comprehensive and systematic MPE. Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLibrary Science and Information Systems
