Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models

Banca Calvo Figueras; Rodrigo Agerri

arXiv:2505.11341·cs.CL·September 24, 2025

Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models

Banca Calvo Figueras, Rodrigo Agerri

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a large-scale dataset and evaluation methods for Critical Questions Generation, a task that enhances reasoning by generating questions that challenge assumptions, with benchmarks for large language models.

Contribution

It provides the first extensive dataset and evaluation framework for CQs-Gen, enabling systematic benchmarking of models on this reasoning task.

Findings

01

Automatic evaluation correlates well with human judgments.

02

Zero-shot LLM performance highlights the task's difficulty.

03

Benchmark results establish a baseline for future research.

Abstract

The task of Critical Questions Generation (CQs-Gen) aims to foster critical thinking by enabling systems to generate questions that expose underlying assumptions and challenge the validity of argumentative reasoning structures. Despite growing interest in this area, progress has been hindered by the lack of suitable datasets and automatic evaluation standards. This paper presents a comprehensive approach to support the development and benchmarking of systems for this task. We construct the first large-scale dataset including ~5K manually annotated questions. We also investigate automatic evaluation methods and propose reference-based techniques as the strategy that best correlates with human judgments. Our zero-shot evaluation of 11 LLMs establishes a strong baseline while showcasing the difficulty of the task. Data and code plus a public leaderboard are provided to encourage further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

HiTZ/CQs-Gen
dataset· 28 dl
28 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Multi-Agent Systems and Negotiation