CaDrift: A Time-dependent Causal Generator of Drifting Data Streams
Eduardo V. L. Barboza, Jean Paul Barddal, Robert Sabourin, Rafael M. O. Cruz

TL;DR
CaDrift is a synthetic data generator based on causal models that creates evolving data streams with controlled shifts, useful for evaluating machine learning methods under changing data distributions.
Contribution
It introduces a novel framework for generating time-dependent, causally consistent data streams with controllable distributional shifts and perturbations.
Findings
Classifiers' accuracy drops after shifts and gradually recovers
CaDrift effectively simulates various types of distributional shifts
The framework is available on GitHub for research use
Abstract
This work presents Causal Drift Generator (CaDrift), a time-dependent synthetic data generator framework based on Structural Causal Models (SCMs). The framework produces a virtually infinite combination of data streams with controlled shift events and time-dependent data, making it a tool to evaluate methods under evolving data. CaDrift synthesizes various distributional and covariate shifts by drifting mapping functions of the SCM, which change underlying cause-and-effect relationships between features and the target. In addition, CaDrift models occasional perturbations by leveraging interventions in causal modeling. Experimental results show that, after distributional shift events, the accuracy of classifiers tends to drop, followed by a gradual retrieval, confirming the generator's effectiveness in simulating shifts. The framework has been made available on GitHub.
Peer Reviews
Decision·Submitted to ICLR 2026
a. This paper claims to be the first drift generator that couples causal structure with explicit temporal dynamics in an SCM framework, going beyond IID or purely probabilistic generators. b. The proposed method supports distributional, covariate, severe, local drifts with abrupt/gradual/incremental/recurrent patterns and configurable windows.
a. No quantitative alignment with real drifting datasets or transfer evidence that tuning on CaDrift improves real-world performance. b. The evaluation only focused on classification accuracy. Little coverage of regression/unsupervised settings, efficiency of generation, or additional evaluation metrics.
- Addresses an important problem, which is well-grounded in the literature. Proper evaluation of concept drift adaptation/detection methods is a long-standing problem in the field. In this context, the paper has the potential to make an impactful contribution. - Mostly well-written and easy to follow. - Implementation of the proposed data generator is provided and seems somewhat easy to use. However, I do have a few suggestions on how to further improve it.
Table 2: I suggest including some notion of variance. Averages alone are not that meaningful. In particular, when it comes to statistical significance analyses. Also, I suggest briefly describing the evaluated methods to make the paper more self-contained and more accessible. Right now, only TabPFN is described, and all other methods are only in Table 10 in the appendix. Section 5.2: I suggest starting this section by stating the purpose or objective of the evaluation that follows. It becomes
1. The work addresses a well-recognized need in the data stream community: the lack of versatile and controllable benchmark generators. The ability to simulate specific types of drifts, including those rooted in causal mechanisms, is a valuable goal. 2. The proposed framework is technically sound. The combination of SCMs for structured data generation, classic time-series components (EWMA, AR noise) for temporal correlation, and parameter modulation for drift is a logical and well-implemented a
1. The primary weakness is the limited novelty of the proposed method. CaDrift is essentially an engineering amalgamation of existing, well-known components: SCMs, EWMA, and AR models. While the integration is effective, it does not introduce new machine learning principles or a novel theoretical framework. The contribution feels more like a useful software tool than a fundamental research advancement. 2. The paper heavily emphasizes its "causal" foundation, yet the experimental design fails to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Time Series Analysis and Forecasting · Advanced Database Systems and Queries
