Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions

Yutao Hou; Yajing Luo; Zhiwen Ruan; Hongru Wang; Weifeng Ge; Yun Chen; Guanhua Chen

arXiv:2411.10163·cs.CL·January 30, 2026

Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions

Yutao Hou, Yajing Luo, Zhiwen Ruan, Hongru Wang, Weifeng Ge, Yun Chen, Guanhua Chen

PDF

Open Access

TL;DR

This paper introduces Compound-QA, a new benchmark for evaluating large language models on complex, multi-part questions, revealing their challenges and proposing strategies for improvement.

Contribution

The paper presents Compound-QA, a novel benchmark for assessing LLMs on compound questions, including a new synthesis method and evaluation across multiple categories.

Findings

01

LLMs perform significantly worse on compound questions than on simple ones.

02

Strategies developed improve LLMs' understanding and reasoning on compound questions.

03

Compound-QA covers diverse question types and evaluation dimensions.

Abstract

Large language models (LLMs) demonstrate remarkable performance across various tasks, prompting researchers to develop diverse evaluation benchmarks. However, most benchmarks typically measure the ability of LLMs to respond to individual questions, neglecting the complex interactions in real-world applications. We introduce Compound Question Synthesis (CQ-Syn) to build Compound-QA, a benchmark targeting questions composed of multiple interrelated sub-questions. This benchmark is derived from existing QA datasets, annotated with proprietary LLMs, and verified by humans for accuracy. It encompasses five categories: Factual-Statement, Cause-and-Effect, Hypothetical-Analysis, Comparison-and-Selection, and Evaluation-and-Suggestion. It evaluates the LLM capability in terms of three dimensions, including understanding, reasoning, and knowledge. Evaluating nine open-source LLMs on Compound-QA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques