SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham; Nguyen Nguyen; Pratibha Zunjare; Weiyuan Chen; Yu-Min Tseng; Tu Vu

arXiv:2506.01062·cs.CL·April 10, 2026

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, Tu Vu

PDF

1 Repo 3 Datasets 1 Video

TL;DR

SealQA is a new benchmark for evaluating the reasoning and factual accuracy of search-augmented language models in noisy, conflicting web search scenarios, revealing significant limitations of current models.

Contribution

The paper introduces SealQA, a comprehensive benchmark for assessing reasoning in search-augmented models, highlighting their vulnerabilities and setting a new standard for future evaluations.

Findings

01

Current frontier models perform poorly on SealQA, with accuracy often below 20%.

02

Increasing compute at test time does not significantly improve model performance.

03

Models struggle to identify relevant documents in long-context, multi-document settings.

Abstract

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Datasets

Videos

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models· slideslive