Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views
Katerina Margatina, Shuai Wang, Yogarshi Vyas, Neha Anna John, Yassine, Benajiba, Miguel Ballesteros

TL;DR
This paper introduces a comprehensive framework for evaluating how well masked language models stay current with evolving factual knowledge over time, using dynamic, multi-granularity test sets derived from Wikidata.
Contribution
It presents a novel holistic framework that dynamically creates temporal test sets, constructs detailed splits, and evaluates MLMs from multiple perspectives to assess their robustness over time.
Findings
Framework enables evaluation at various time granularities.
Multi-view evaluation reveals models' robustness to factual updates.
Benchmarking 11 pretrained MLMs on temporal data.
Abstract
Temporal concept drift refers to the problem of data changing over time. In NLP, that would entail that language (e.g. new expressions, meaning shifts) and factual knowledge (e.g. new concepts, updated facts) evolve over time. Focusing on the latter, we benchmark pretrained masked language models (MLMs) on a series of tests designed to evaluate the effect of temporal concept drift, as it is crucial that widely used language models remain up-to-date with the ever-evolving factual updates of the real world. Specifically, we provide a holistic framework that (1) dynamically creates temporal test sets of any time granularity (e.g. month, quarter, year) of factual data from Wikidata, (2) constructs fine-grained splits of tests (e.g. updated, new, unchanged facts) to ensure comprehensive analysis, and (3) evaluates MLMs in three distinct ways (single-token probing, multi-token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Recommender Systems and Techniques · Caching and Content Delivery
MethodsTest
