Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions
Yutao Hou, Yajing Luo, Zhiwen Ruan, Hongru Wang, Weifeng Ge, Yun Chen, Guanhua Chen

TL;DR
This paper introduces Compound-QA, a new benchmark for evaluating large language models on complex, multi-part questions, revealing their challenges and proposing strategies for improvement.
Contribution
The paper presents Compound-QA, a novel benchmark for assessing LLMs on compound questions, including a new synthesis method and evaluation across multiple categories.
Findings
LLMs perform significantly worse on compound questions than on simple ones.
Strategies developed improve LLMs' understanding and reasoning on compound questions.
Compound-QA covers diverse question types and evaluation dimensions.
Abstract
Large language models (LLMs) demonstrate remarkable performance across various tasks, prompting researchers to develop diverse evaluation benchmarks. However, most benchmarks typically measure the ability of LLMs to respond to individual questions, neglecting the complex interactions in real-world applications. We introduce Compound Question Synthesis (CQ-Syn) to build Compound-QA, a benchmark targeting questions composed of multiple interrelated sub-questions. This benchmark is derived from existing QA datasets, annotated with proprietary LLMs, and verified by humans for accuracy. It encompasses five categories: Factual-Statement, Cause-and-Effect, Hypothetical-Analysis, Comparison-and-Selection, and Evaluation-and-Suggestion. It evaluates the LLM capability in terms of three dimensions, including understanding, reasoning, and knowledge. Evaluating nine open-source LLMs on Compound-QA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
