DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes

Jiehan Cheng; Zhicheng Dou

arXiv:2505.17162·cs.IR·May 26, 2025

DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes

Jiehan Cheng, Zhicheng Dou

PDF

TL;DR

DailyQA is a dynamic benchmark dataset that evaluates large language models' ability to process and answer questions based on rapidly changing web information, highlighting current challenges in handling real-world updates.

Contribution

We introduce DailyQA, an automatically updated dataset for evaluating LLMs on real-time web data, and analyze models' performance in processing time-sensitive information.

Findings

01

Web retrieval reranking is crucial for accuracy.

02

LLMs struggle with frequently updated data.

03

DailyQA offers insights into LLM progress on real-world tasks.

Abstract

We propose DailyQA, an automatically updated dynamic dataset that updates questions weekly and contains answers to questions on any given date. DailyQA utilizes daily updates from Wikipedia revision logs to implement a fully automated pipeline of data filtering, query generation synthesis, quality checking, answer extraction, and query classification. The benchmark requires large language models (LLMs) to process and answer questions involving fast-changing factual data and covering multiple domains. We evaluate several open-source and closed-source LLMs using different RAG pipelines with web search augmentation. We compare the ability of different models to process time-sensitive web information and find that rerank of web retrieval results is critical. Our results indicate that LLMs still face significant challenges in handling frequently updated information, suggesting that DailyQA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Linear Warmup With Linear Decay · Attention Dropout · Byte Pair Encoding · Softmax · Linear Layer · Dropout · Dense Connections · Attention Is All You Need