Are Synthetic Time-series Data Really not as Good as Real Data?
Fanzhe Fu, Junru Chen, Jing Zhang, Carl Yang, Lvbin Ma, Yang Yang

TL;DR
This paper introduces InfoBoost, a versatile framework for synthetic time-series data generation that enables training models without real data, surpassing real data performance and providing universal feature extraction.
Contribution
The paper presents a novel synthetic data generation method and a universal feature extractor for time-series data, improving model training and generalization without relying on real data.
Findings
Synthetic data enables superior model reconstruction performance.
Universal feature extractor applies across diverse time-series data.
Method overcomes noise, rhythmic, and long-period feature challenges.
Abstract
Time-series data presents limitations stemming from data quality issues, bias and vulnerabilities, and generalization problem. Integrating universal data synthesis methods holds promise in improving generalization. However, current methods cannot guarantee that the generator's output covers all unseen real data. In this paper, we introduce InfoBoost -- a highly versatile cross-domain data synthesizing framework with time series representation learning capability. We have developed a method based on synthetic data that enables model training without the need for real data, surpassing the performance of models trained with real data. Additionally, we have trained a universal feature extractor based on our synthetic data that is applicable to all time-series data. Our approach overcomes interference from multiple sources rhythmic signal, noise interference, and long-period features that…
Peer Reviews
Decision·Submitted to ICLR 2025
- **Variety of the datasets in the experiments**. The authors evaluate their method on a wide variety of datasets.
- **Writing.** The paper is hard to read and needs to be better structured. For instance, it would be helpful to provide a pseudocode and a dedicated section that describes the proposed method step-by-step. Additionally, the design choices are not discussed. The paper contains many typos and requires significant polishing. - **Methodological contribution.** The authors propose an algorithm for generating random time series signals while not discussing at all any guarantees of such an algorithm.
It is good that they have suggested to use a wide array of signals for the synthetic data generation.
This is unfortunately not enough to have sufficient novelty for publication, although it can be built upon further. And it is not carefully thought out in my opinion. For example we may consider Figure 1 there is a diagrammatic separation of "Ryth", "Noise", and "Trend", in a way that these feel orthogonal constructors to an overarching to an extent, but these are not in practice. "Ryth" has elements of noise and trend added into it, "Noise" has elements of Ryth and trend added into it implicitl
The goal is important: of generating simulated data to aid in learning when data are limited. And to do that in as task-agnostic way as possible. The proposed method is tested in three different kinds of case studies.
The writing is unclear. Concepts are not introduced before being used, and the main learning setup and goal is not stated clearly anywhere. Some examples in the Questions below. A couple of others: In Fig 7 it is not clear what the fig shows and what can be concluded from it. In Eqn 8 the notation has not been introduced. Overstated claims: While some aspects of the proposed method may be "non-DL" and "without the need for real data or data statistics", it is hard to see how generalizable learn
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting
