PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading
Yutao Wu, Xiao Liu, Yunhao Feng, Jiale Ding, Xingjun Ma

TL;DR
PaperAsk introduces a comprehensive benchmark to evaluate the reliability of large language models in scholarly tasks, revealing significant failure rates and developing classifiers to improve trustworthiness in research assistance.
Contribution
This work presents PaperAsk, a novel benchmark for systematic reliability evaluation of LLMs in paper search and reading, with new diagnostic tools and insights into failure modes.
Findings
High failure rates in citation retrieval and content extraction.
LLMs often produce fabricated or incomplete information.
Reliability classifiers can identify unreliable outputs effectively.
Abstract
Large Language Models (LLMs) increasingly serve as research assistants, yet their reliability in scholarly tasks remains under-evaluated. In this work, we introduce PaperAsk, a benchmark that systematically evaluates LLMs across four key research tasks: citation retrieval, content extraction, paper discovery, and claim verification. We evaluate GPT-4o, GPT-5, and Gemini-2.5-Flash under realistic usage conditions-via web interfaces where search operations are opaque to the user. Through controlled experiments, we find consistent reliability failures: citation retrieval fails in 48-98% of multi-reference queries, section-specific content extraction fails in 72-91% of cases, and topical paper discovery yields F1 scores below 0.32, missing over 60% of relevant literature. Further human analysis attributes these failures to the uncontrolled expansion of retrieved context and the tendency of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
