Evaluating LLMs with Multiple Problems at once

Zhengxiang Wang; Jordan Kodner; Owen Rambow

arXiv:2406.10786·cs.AI·June 24, 2025

Evaluating LLMs with Multiple Problems at once

Zhengxiang Wang, Jordan Kodner, Owen Rambow

PDF

Open Access 1 Repo

TL;DR

This paper advocates for multi-problem evaluation (MPE) of LLMs, introduces the ZeMPE benchmark with 53,100 prompts, and systematically assesses 13 models, revealing strengths and limitations in handling multiple problems simultaneously.

Contribution

It introduces the ZeMPE benchmark for zero-shot multi-problem evaluation and provides a comprehensive analysis of LLMs' capabilities in handling multiple problems at once.

Findings

01

LLMs can handle multiple problems from a single source.

02

Performance varies under different conditions.

03

Model-level factors influence multi-problem handling.

Abstract

This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once, a paradigm we call multi-problem evaluation (MPE). Unlike conventional single-problem evaluation, where a prompt presents a single problem and expects one specific answer, MPE places multiple problems together in a single prompt and assesses how well an LLM answers all these problems in a single output. Leveraging 6 classification and 12 reasoning benchmarks that already exist, we introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts. We experiment with a total of 13 LLMs from 5 model families on ZeMPE to present a comprehensive and systematic MPE. Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jaaack-wang/multi-problem-eval-llm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLibrary Science and Information Systems