Benchmarking Real-Time Question Answering via Executable Code Workflows
Wenjie Zhou, Yuan Gao, Xin Zhou, Hao Fu, Zhongjian Miao, Wei Chen, Bo Chen, Xiaobing Zhao

TL;DR
This paper introduces RT-QA, a dynamic benchmarking framework for real-time question answering that uses executable code workflows to evaluate and improve agents' ability to retrieve up-to-date information.
Contribution
It presents a novel evaluation framework with web crawling and self-repair mechanisms, highlighting the limitations of current models in real-time adaptability.
Findings
State-of-the-art models achieve only 46% accuracy in real-time QA.
Major failure modes include lazy retrieval and temporal confusion.
Existing models struggle with dynamic, time-sensitive information retrieval.
Abstract
Retrieving real-time information is a fundamental capability for search-integrated agents in real-world applications. However, existing benchmarks are predominantly static and therefore fail to capture the temporal dynamics of information and the continuously evolving nature of real-world knowledge. To address this limitation, we propose RT-QA, a dynamic evaluation framework that leverages executable code workflows to retrieve up-to-date answers at evaluation time. Specifically, we construct an agent-driven pipeline that autonomously generates code for web crawling and DOM-based answer extraction to produce real-time ground truth. To ensure robust evaluation over time, the pipeline further incorporates a self-repair mechanism to adapt to changes in web page structures. RT-QA spans 12 domains (e.g., Finance, Sports) with 320 Chinese questions categorized into three difficulty levels.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
