LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News
Yunfan Zhang, Kathleen McKeown, Smaranda Muresan

TL;DR
LiveNewsBench is a new benchmark that automatically generates and updates challenging news-based questions to evaluate large language models' ability to perform agentic web search and real-time information retrieval.
Contribution
The paper introduces ench, a regularly updated benchmark with automated question generation from recent news, enabling robust evaluation of LLMs' web search capabilities and supporting dataset creation.
Findings
LLMs vary significantly in web search performance.
The benchmark reveals strengths and weaknesses of different models.
Automated data curation facilitates large-scale evaluation.
Abstract
Large Language Models (LLMs) with agentic web search capabilities show strong potential for tasks requiring real-time information access and complex fact retrieval, yet evaluating such systems remains challenging. We introduce \bench, a rigorous and regularly updated benchmark designed to assess the agentic web search abilities of LLMs. \bench automatically generates fresh question-answer pairs from recent news articles, ensuring that questions require information beyond an LLM's training data and enabling clear separation between internal knowledge and search capability. The benchmark features intentionally difficult questions requiring multi-hop search queries, page visits, and reasoning, making it well-suited for evaluating agentic search behavior. Our automated data curation and question generation pipeline enables frequent benchmark updates and supports construction of a…
Peer Reviews
Decision·Submitted to ICLR 2026
- The motivation for the benchmark is clear: There is a need to better evaluate search-enabled models on events beyond their training. - The approach of combining question generation, validation, and human verification seems likely to produce quality QA pairs (though at the moment this can't be completely validated -- see weaknesses below). - The paper is generally well-written.
**1**) It is not clear to me that the paper makes a sufficient research contribution beyond prior work on "live" QA benchmarks. Because it requires human validation, the benchmark cannot be updated arbitrarily frequently. The authors commit to updating it quarterly for 2 years. This seems similar to prior efforts, such as FreshQA and RealTimeQA. The paper states that RealTimeQA is not "continuously updated"; but if my understanding is correct, it was continuously updated (weekly) for some t
- The paper is well written and easy to understand. - Benchmark construction steps are clearly documented. QA verification with both different models and humans are conducted, making sure the QA qualities are controlled. - Experiments with the search enabled LLMs demonstrate the usefulness of the proposed benchmark.
- The novelty of the work is limited, given the prior research in this direction, such as Fresh QA [1] and Daily Oracle [2]. - The news fetching pipeline is indirect and could accumulate errors. The paper takes a backward approach, going from Wikipedia events to find possible url to the actual news articles. Errors could compound in these matching processes. Why not directly fetch the news articles to generate the QA pairs? - The benchmark construction relies on heavy usage of different LLMs,
- The paper identifies and mitigates a major flaw in existing web-search benchmarks by using recent news events that occur after model training cutoffs, which ensures minimal contamination from pretraining data - The benchmark's fully automated pipeline produces large quantities of question-answer pairs with limited human input while maintaining reasonable quality - The benchmark is designed to refresh quarterly, which helps maintain relevance and tests models on genuinely new information - T
- The paper overstates the limitations of static benchmarks. Static questions remain informative when their correct answers have changed since a model's training cutoff, as they still require retrieval rather than recall. The authors have not convincingly shown that they need to keep generating new benchmarks every few months -- a well-designed static benchmark with time-sensitive questions could already achieve the same goal. - The benchmark conflates retrieval ability with reasoning accuracy.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Expert finding and Q&A systems
