Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs
Sahil Kale

TL;DR
This paper evaluates how well modern large language models utilize integrated web search to improve factual accuracy, revealing strengths in static knowledge and challenges in dynamic, real-time information retrieval.
Contribution
Introduces a benchmark for assessing the necessity and effectiveness of web search in LLMs, highlighting current limitations and potential improvements.
Findings
Web access improves static accuracy for some models
Models often invoke search but with low accuracy on dynamic queries
Overconfidence and retrieval failures limit effectiveness
Abstract
Modern large language models integrate web search to provide real-time answers, yet it remains unclear whether they are efficiently calibrated to use search when it is actually needed. We introduce a benchmark evaluating both the necessity and effectiveness of web access across commercial models with no access to internal states or parameters. The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge, aimed at testing whether models invoke search based on low internal confidence, and a dynamic split of 288 post-cutoff queries designed to test whether models recognise when search is required and retrieve updated information. Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens. On dynamic queries, both models frequently invoke search yet remain below 70 percent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education
