Benchmark Datasets for Lead-Lag Forecasting on Social Platforms

Kimia Kazemian (1); Zhenzhen Liu (1); Yangfanyu Yang (2); Katie Z Luo (1); Shuhan Gu (1); Audrey Du (1); Xinyu Yang (2); Jack Jansons (1); Kilian Q Weinberger (1); John Thickstun (1); Yian Yin (2); Sarah Dean (1) ((1) Department of Computer Science; Cornell University (Ithaca; USA); (2) Department of Information Science; Cornell University (Ithaca; USA))

arXiv:2511.03877·cs.LG·November 7, 2025

Benchmark Datasets for Lead-Lag Forecasting on Social Platforms

Kimia Kazemian (1), Zhenzhen Liu (1), Yangfanyu Yang (2), Katie Z Luo (1), Shuhan Gu (1), Audrey Du (1), Xinyu Yang (2), Jack Jansons (1), Kilian Q Weinberger (1), John Thickstun (1), Yian Yin (2), Sarah Dean (1) ((1) Department of Computer Science, Cornell University (Ithaca

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces standardized benchmark datasets for Lead-Lag Forecasting in social platforms, enabling systematic research into long-term, cross-domain temporal dependencies in user interaction data.

Contribution

It provides the first high-volume, multi-domain datasets for LLF, formalizes LLF as a new forecasting paradigm, and offers baseline evaluations to foster future research.

Findings

01

Confirmed presence of lead-lag dynamics in datasets

02

Benchmark datasets enable systematic LLF research

03

Baseline models establish initial performance metrics

Abstract

Social and collaborative platforms emit multivariate time-series traces in which early interactions-such as views, likes, or downloads-are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets-arXiv (accesses -> citations of 2.3M papers) and GitHub (pushes/stars -> forks of 3M repositories)-and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page views -> edits), Spotify (streams -> concert…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 2

Strengths

Clearly introduces and formalizes Lead-Lag Forecasting (LLF), which fills a significant gap in time-series research. Curates two valuable datasets (arXiv, GitHub) with careful preprocessing, long-range horizons, and minimal survivorship bias.

Weaknesses

The work mostly benchmarks existing models and does not propose new LLF-specific architectures or techniques. Although included, the use of the Time-MoE foundation model does not yield significant gains, and its contribution appears limited or inconclusive.

Reviewer 02Rating 2Confidence 4

Strengths

S1. An interesting prediction question is proposed, and two large-scale datasets are provided. S2. The statistical analysis is provided to confirm the relations between the lead and the lag.

Weaknesses

W1. I am not quite convinced by the problem setting. Why is the lag signal not included in the lead signals? The previous signals of the lag are very likely available in both the ArXiv and GitHub scenarios. With such inclusion, several related works have been previously proposed, such as [R1]. W2. The proposed datasets might be so simple that Linear Regression can achieve great performance. It casts doubt on whether the relation between the lead and the lag is complex enough for further discove

Reviewer 03Rating 6Confidence 2

Strengths

1. *Valuable and Novel Datasets*: The paper releases two large-scale, novel datasets for a well-defined and relevant forecasting problem. The scale (millions of papers and repositories) and the long-term nature seem very useful. 2. *Clear Problem Formulation*: The concept of Lead-Lag Forecasting is articulated clearly, providing a solid framework for future work. The paper is well-written and easy to follow. 3. *Comprehensive Benchmarking*: The authors have made a significant effort to benchmark

Weaknesses

1. *Missing citation*: The paper fails to acknowledge that previous work has already investigated the influence of earlier accesses of arXiv papers on the citation count (Tim Brody, Stevan Harnad, Leslie Carr (2006): "*Earlier Web usage statistics as predictors of later citation impact*"). 2. *Limited new insights*: As a dataset paper, the primary contribution is not methodological, so it is not stricly required that the paper introduces any new modelling methodology. However, it would be useful

Code & Models

Datasets

LeadLagForecasting/llf_github
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsForecasting Techniques and Applications · Mobile Crowdsensing and Crowdsourcing · Complex Network Analysis Techniques