Benchmark Datasets for Lead-Lag Forecasting on Social Platforms
Kimia Kazemian (1), Zhenzhen Liu (1), Yangfanyu Yang (2), Katie Z Luo (1), Shuhan Gu (1), Audrey Du (1), Xinyu Yang (2), Jack Jansons (1), Kilian Q Weinberger (1), John Thickstun (1), Yian Yin (2), Sarah Dean (1) ((1) Department of Computer Science, Cornell University (Ithaca

TL;DR
This paper introduces standardized benchmark datasets for Lead-Lag Forecasting in social platforms, enabling systematic research into long-term, cross-domain temporal dependencies in user interaction data.
Contribution
It provides the first high-volume, multi-domain datasets for LLF, formalizes LLF as a new forecasting paradigm, and offers baseline evaluations to foster future research.
Findings
Confirmed presence of lead-lag dynamics in datasets
Benchmark datasets enable systematic LLF research
Baseline models establish initial performance metrics
Abstract
Social and collaborative platforms emit multivariate time-series traces in which early interactions-such as views, likes, or downloads-are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets-arXiv (accesses -> citations of 2.3M papers) and GitHub (pushes/stars -> forks of 3M repositories)-and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page views -> edits), Spotify (streams -> concert…
Peer Reviews
Decision·Submitted to ICLR 2026
Clearly introduces and formalizes Lead-Lag Forecasting (LLF), which fills a significant gap in time-series research. Curates two valuable datasets (arXiv, GitHub) with careful preprocessing, long-range horizons, and minimal survivorship bias.
The work mostly benchmarks existing models and does not propose new LLF-specific architectures or techniques. Although included, the use of the Time-MoE foundation model does not yield significant gains, and its contribution appears limited or inconclusive.
S1. An interesting prediction question is proposed, and two large-scale datasets are provided. S2. The statistical analysis is provided to confirm the relations between the lead and the lag.
W1. I am not quite convinced by the problem setting. Why is the lag signal not included in the lead signals? The previous signals of the lag are very likely available in both the ArXiv and GitHub scenarios. With such inclusion, several related works have been previously proposed, such as [R1]. W2. The proposed datasets might be so simple that Linear Regression can achieve great performance. It casts doubt on whether the relation between the lead and the lag is complex enough for further discove
1. *Valuable and Novel Datasets*: The paper releases two large-scale, novel datasets for a well-defined and relevant forecasting problem. The scale (millions of papers and repositories) and the long-term nature seem very useful. 2. *Clear Problem Formulation*: The concept of Lead-Lag Forecasting is articulated clearly, providing a solid framework for future work. The paper is well-written and easy to follow. 3. *Comprehensive Benchmarking*: The authors have made a significant effort to benchmark
1. *Missing citation*: The paper fails to acknowledge that previous work has already investigated the influence of earlier accesses of arXiv papers on the citation count (Tim Brody, Stevan Harnad, Leslie Carr (2006): "*Earlier Web usage statistics as predictors of later citation impact*"). 2. *Limited new insights*: As a dataset paper, the primary contribution is not methodological, so it is not stricly required that the paper introduces any new modelling methodology. However, it would be useful
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForecasting Techniques and Applications · Mobile Crowdsensing and Crowdsourcing · Complex Network Analysis Techniques
