UQ: Assessing Language Models on Unsolved Questions

Fan Nie; Ken Ziyu Liu; Zihao Wang; Rui Sun; Wei Liu; Weijia Shi; Huaxiu Yao; Linjun Zhang; Andrew Y. Ng; James Zou; Sanmi Koyejo; Yejin Choi; Percy Liang; Niklas Muennighoff

arXiv:2508.17580·cs.CL·August 26, 2025

UQ: Assessing Language Models on Unsolved Questions

Fan Nie, Ken Ziyu Liu, Zihao Wang, Rui Sun, Wei Liu, Weijia Shi, Huaxiu Yao, Linjun Zhang, Andrew Y. Ng, James Zou, Sanmi Koyejo, Yejin Choi, Percy Liang, Niklas Muennighoff

PDF

1 Datasets 3 Reviews

TL;DR

This paper introduces UQ, a novel benchmark of 500 challenging, real-world unsolved questions from diverse topics, designed to evaluate language models on difficult, realistic tasks with community verification, aiming to push AI capabilities forward.

Contribution

The paper presents UQ, a comprehensive platform with curated unsolved questions, validation strategies, and community verification to assess language models on real-world, open-ended challenges.

Findings

01

Top model passes only 15% of questions on UQ validation.

02

Preliminary human review found correct answers among those passing validation.

03

UQ provides a realistic, challenging benchmark for frontier language models.

Abstract

Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

+ A significant strength is the creation of the UQ-Platform. The authors acknowledge that automated validation is insufficient for these complex, open-ended problems. By building an ecosystem for human experts to review, rate, and verify AI-generated answers, they create a sustainable, long-term evaluation method that can adapt as models improve. + The benchmark's design directly measures a model's ability to solve problems that are currently unsolved by humans. This is a powerful concept becau

Weaknesses

- The core premise, using unsolved questions, creates a fundamental verification bottleneck. Without ground truth, evaluation relies on the UQ-Validators (which are imperfect) and human experts. Sourcing and compensating a diverse pool of domain experts to verify a large volume of complex answers is costly, time-consuming, and not easily scalable. - The UQ-Validator pipeline is intricate and depends on LLMs judging other LLMs, a process known to have biases. The paper's own results show the bes

Reviewer 02Rating 2Confidence 4

Strengths

- The paper explores alternatives to artificial exam-like benchmarks and tries to tackle the problem of benchmarks with highly difficult but unrealistic problems. - The collection pipeline is well curated, with several types of filters and explicit filtering criteria such as well-posedness, difficulty etc. - The fact that the benchmark is hosted on a live platform helps checking question quality, model answers and even provide solutions to problems - Validators are useful to rule out wrong answe

Weaknesses

- The paper suggests that unsolved problems are “by construction” realistic. However, there are several unsolved problems that are really hard, but artificial. Conversely, not all difficult solved problems are inherently unrealistic and artificial. As an example, research problems are unsolved but still natural for the context. It’d be better to add some more evidence about the realism claim. - Given the absence of ground truth answers and with validators being indicative but not conclusive (acc

Reviewer 03Rating 4Confidence 3

Strengths

1. The motivation to design both difficult and realistic benchmarks, which provide challenging practical questions, is novel and meaningful 2. The paper proposes a reference-free validation method to evaluate the correctness of the LLM response on the unlabeled question-answer datapoints 3. Focusing on currently unsolved problems stimulates the usage of LLM in answering new research questions or exploring new research directions

Weaknesses

1. The reason behind the dataset creation criteria is unclear. How the rules set for the rule-based filter and LLM-based filter contribute to the difficulty and realism is not clearly illustrated. 2. Some processes use the LLM to provide the label. The reliability of the LLM on such tasks is unclear. For example, the “approachable: whether the question is logically sound and solvable in principle”, seems to be a challenging task for LLM to provide a reliable label. 3. The definition of unsolvab

Code & Models

Datasets

uq-project/uq
dataset· 67 dl
67 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.