CaDrift: A Time-dependent Causal Generator of Drifting Data Streams

Eduardo V. L. Barboza; Jean Paul Barddal; Robert Sabourin; Rafael M. O. Cruz

arXiv:2602.20329·cs.LG·February 25, 2026

CaDrift: A Time-dependent Causal Generator of Drifting Data Streams

Eduardo V. L. Barboza, Jean Paul Barddal, Robert Sabourin, Rafael M. O. Cruz

PDF

Open Access 3 Reviews

TL;DR

CaDrift is a synthetic data generator based on causal models that creates evolving data streams with controlled shifts, useful for evaluating machine learning methods under changing data distributions.

Contribution

It introduces a novel framework for generating time-dependent, causally consistent data streams with controllable distributional shifts and perturbations.

Findings

01

Classifiers' accuracy drops after shifts and gradually recovers

02

CaDrift effectively simulates various types of distributional shifts

03

The framework is available on GitHub for research use

Abstract

This work presents Causal Drift Generator (CaDrift), a time-dependent synthetic data generator framework based on Structural Causal Models (SCMs). The framework produces a virtually infinite combination of data streams with controlled shift events and time-dependent data, making it a tool to evaluate methods under evolving data. CaDrift synthesizes various distributional and covariate shifts by drifting mapping functions of the SCM, which change underlying cause-and-effect relationships between features and the target. In addition, CaDrift models occasional perturbations by leveraging interventions in causal modeling. Experimental results show that, after distributional shift events, the accuracy of classifiers tends to drop, followed by a gradual retrieval, confirming the generator's effectiveness in simulating shifts. The framework has been made available on GitHub.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

a. This paper claims to be the first drift generator that couples causal structure with explicit temporal dynamics in an SCM framework, going beyond IID or purely probabilistic generators. b. The proposed method supports distributional, covariate, severe, local drifts with abrupt/gradual/incremental/recurrent patterns and configurable windows.

Weaknesses

a. No quantitative alignment with real drifting datasets or transfer evidence that tuning on CaDrift improves real-world performance. b. The evaluation only focused on classification accuracy. Little coverage of regression/unsupervised settings, efficiency of generation, or additional evaluation metrics.

Reviewer 02Rating 6Confidence 4

Strengths

- Addresses an important problem, which is well-grounded in the literature. Proper evaluation of concept drift adaptation/detection methods is a long-standing problem in the field. In this context, the paper has the potential to make an impactful contribution. - Mostly well-written and easy to follow. - Implementation of the proposed data generator is provided and seems somewhat easy to use. However, I do have a few suggestions on how to further improve it.

Weaknesses

Table 2: I suggest including some notion of variance. Averages alone are not that meaningful. In particular, when it comes to statistical significance analyses. Also, I suggest briefly describing the evaluated methods to make the paper more self-contained and more accessible. Right now, only TabPFN is described, and all other methods are only in Table 10 in the appendix. Section 5.2: I suggest starting this section by stating the purpose or objective of the evaluation that follows. It becomes

Reviewer 03Rating 2Confidence 3

Strengths

1. The work addresses a well-recognized need in the data stream community: the lack of versatile and controllable benchmark generators. The ability to simulate specific types of drifts, including those rooted in causal mechanisms, is a valuable goal. 2. The proposed framework is technically sound. The combination of SCMs for structured data generation, classic time-series components (EWMA, AR noise) for temporal correlation, and parameter modulation for drift is a logical and well-implemented a

Weaknesses

1. The primary weakness is the limited novelty of the proposed method. CaDrift is essentially an engineering amalgamation of existing, well-known components: SCMs, EWMA, and AR models. While the integration is effective, it does not introduce new machine learning principles or a novel theoretical framework. The contribution feels more like a useful software tool than a fundamental research advancement. 2. The paper heavily emphasizes its "causal" foundation, yet the experimental design fails to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Time Series Analysis and Forecasting · Advanced Database Systems and Queries