TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time

Thales Sales Almeida; Giovana Kerche Bon\'as; Jo\~ao Guilherme Alves Santos; Hugo Abonizio; Rodrigo Nogueira

arXiv:2501.07482·cs.CL·May 21, 2025

TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time

Thales Sales Almeida, Giovana Kerche Bon\'as, Jo\~ao Guilherme Alves Santos, Hugo Abonizio, Rodrigo Nogueira

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces TiEBe, a comprehensive benchmark dataset to evaluate how well large language models recall notable global and regional events over time, revealing disparities based on geography, language, and socioeconomic factors.

Contribution

The paper presents TiEBe, a novel dataset for assessing LLMs' temporal and regional factual recall, highlighting geographic and language-based disparities in model knowledge.

Findings

01

Significant geographic disparities in factual recall.

02

High correlation between model performance and socioeconomic indicators.

03

Performance gaps for low-resource languages.

Abstract

As the knowledge landscape evolves and large language models (LLMs) become increasingly widespread, there is a growing need to keep these models updated with current events. While existing benchmarks assess general factual recall, few studies explore how LLMs retain knowledge over time or across different regions. To address these gaps, we present the Timely Events Benchmark (TiEBe), a dataset of over 23,000 question-answer pairs centered on notable global and regional events, spanning more than 10 years of events, 23 regions, and 13 languages. TiEBe leverages structured retrospective data from Wikipedia to identify notable events through time. These events are then used to construct a benchmark to evaluate LLMs' understanding of global and regional developments, grounded in factual evidence beyond Wikipedia itself. Our results reveal significant geographic disparities in factual…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

This work presents a large dataset of questions about events from different countries and times. One distinction from prior works establishing similar datasets is the inclusion of questions in different languages reflecting the country the question pertains to.

Weaknesses

1. Limited discussion of related work, both in the related work section and in the paper as a whole. In particular, lost of prior work has explored the impact of temporal or geographical context on wikipedia-based factoid QA [1]. Other works have looked at both individually [2, 3]. Additionally other works have developed other methods of synthetically identifying such temporally dependent QA pairs or facts from Wikipedia pages or other related resources like wikidata [4, 5]. Other works have loo

Reviewer 02Rating 2Confidence 4

Strengths

The paper is grounded in strong motivation and tackles an important and relevant problem. Building a benchmark to evaluate time-sensitive world knowledge in LLMs fills a major gap in current evaluation practices, making this contribution meaningful and well-justified.

Weaknesses

1. Unclear justification for using only DeepSeek-V3: The rationale behind selecting DeepSeek-V3 as the sole model for generating questions, translations, and evaluations is not well supported. Although the authors claim the model performs adequately, their own results (Table 2) show that GPT-4o aligns more closely with human judgments. Relying exclusively on one model—especially a less optimal one—seems unjustified. A more credible design would validate outputs across multiple models or through

Reviewer 03Rating 2Confidence 3

Strengths

1. The paper presents a scalable method for benchmark creation by using Wikipedia retrospective pages to identify notable events, ensuring a structured and continuously updatable data source for temporal analysis. 2. Experimental evaluation is extensive, testing nine different open-source and commercial LLMs. This provides a broad and representative assessment of current model capabilities and their shared limitations.

Weaknesses

### About Method 1. The paper only validates LLM-as-judge reliability but lacks human evaluation of the generated QA pairs themselves. LLM-generated questions may contain factual errors or inconsistencies. Recommend conducting human evaluation on a sample to assess quality metrics like factual consistency and answerability. 2. GPT-4o achieved 91% consistency versus DeepSeek-V3's 88.5%, yet the paper chose the lower-performing DeepSeek-V3 as judge without explanation (cost, API availability, et

Code & Models

Repositories

timelyeventsbenchmark/tiebe
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Advanced Text Analysis Techniques