DynamicBench: Evaluating Real-Time Report Generation in Large Language Models
Jingyao Li, Hao Sun, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Hong Xu, Jiaya Jia

TL;DR
DynamicBench is a new benchmark for evaluating large language models' ability to generate real-time reports using up-to-date information retrieval and processing, addressing limitations of static evaluation methods.
Contribution
We introduce DynamicBench, a novel benchmark and a report generation system that assess LLMs' proficiency in handling dynamic, real-time data in specialized domains.
Findings
Our method surpasses GPT4o by 7.0% in document-free scenarios.
Achieves 5.8% higher performance in document-assisted scenarios.
Demonstrates state-of-the-art results in dynamic information synthesis.
Abstract
Traditional benchmarks for large language models (LLMs) typically rely on static evaluations through storytelling or opinion expression, which fail to capture the dynamic requirements of real-time information processing in contemporary applications. To address this limitation, we present DynamicBench, a benchmark designed to evaluate the proficiency of LLMs in storing and processing up-to-the-minute data. DynamicBench utilizes a dual-path retrieval pipeline, integrating web searches with local report databases. It necessitates domain-specific knowledge, ensuring accurate responses report generation within specialized fields. By evaluating models in scenarios that either provide or withhold external documents, DynamicBench effectively measures their capability to independently process recent information or leverage contextual enhancements. Additionally, we introduce an advanced report…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is clearly written with informative figures and structure. 2. The topic is timely and highly relevant to current LLM capabilities (real-time information retrieval, and report generation). 3. The overall idea is easy to understand.
1. **Reliability of the accuracy evaluation** My main concern is that the proposed accuracy evaluation methodology lacks sufficient rigor. Several related issues are discussed below: a. **Error propagation in intermediate steps** The paper evaluates report accuracy by generating QA pairs from the input report (apparently using an LLM), then querying both a local report database and the web to verify answers. However, these intermediate steps also introduce potential noise. For e
It’s a fresh idea, benchmarking for dynamic info rather than static stuff.The way they combine web and database searching seems practical.Experiments are comprehensive, lots of different evaluation criteria. Results sound impressive, and they tackle a real gap in current LLM evaluation.
Hard to say how well this works for real niche domains (like legal/medical info)—could be limited by what’s on the web or in their databases. If the external data sources are bad or biased, the benchmarks could be off. I’m not sure how robust it is against bad info. Some models with longer outputs seemed less readable/comprehensible—curious if that’s a fixable issue or just a tradeoff. Even with rating guidelines, measures like “readability” and “applicability” still have some subjectivity. mayb
1. The topic itself (evaluating LLMs' ability to generate reports based on real-time / up-to-date information) is highly practical and relevant to real-world usage of LLMs, so this research direction is worth pursuing. 2. The benchmark includes data from a variety of categories including technology & science, economy & environment, culture & health, and internation & politics. 3. Experiments are done with different LLMs and abalations.
1. Based on the contents of the paper, my major concern is that the design of DynamicBench does not really measure the "real-time" or "up-to-date" report generation ability in LLMs as the authors claimed. Take one of the query examples from Figure 1, "The growth and challenges of the global semiconductor industry from 2024 to 2025", this query will be outdated in the years after 2025. Also, the authors didn't provide details about how the input queries (i.e., the queries used to ask LLMs for gen
- The benchmark spans multiple specialized fields (Tech & Science, Economy & Environment, Culture & Health, International & Politics) - The four-stage process (planning, search, writing, merging) ensures structured, accurate, and coherent outputs
- This paper emphasis the importance of real-time data but rely heavily on the web and local databases. Local database are collected from AnnualReport, which is also not real-time data. - Are you planing to release the benchmark? - Despite the amazing performance in Figure6, can you make sure that the comparison is fair? In your methods, you did a multi-stage workflow including planning, search, writing and merge. What about other methods? Are they writing report end2end? If so, you should sel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
