Time and the Value of Data
Ehsan Valavi, Joel Hestness, Newsha Ardalani, Marco Iansiti

TL;DR
This paper explores how data relevance diminishes over time, affecting the optimal data collection strategy and the value of data in machine learning, emphasizing the importance of recent data for accuracy and competitive advantage.
Contribution
It introduces a theoretical framework showing the impact of data aging on model accuracy and business value, supported by empirical evidence from a next word prediction experiment.
Findings
Older datasets can reduce model accuracy over time.
Recent data provides more value than older data for machine learning.
The value of data declines significantly after several years.
Abstract
Managers often believe that collecting more data will continually improve the accuracy of their machine learning models. However, we argue in this paper that when data lose relevance over time, it may be optimal to collect a limited amount of recent data instead of keeping around an infinite supply of older (less relevant) data. In addition, we argue that increasing the stock of data by including older datasets may, in fact, damage the model's accuracy. Expectedly, the model's accuracy improves by increasing the flow of data (defined as data collection rate); however, it requires other tradeoffs in terms of refreshing or retraining machine learning models more frequently. Using these results, we investigate how the business value created by machine learning models scales with data and when the stock of data establishes a sustainable competitive advantage. We argue that data's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStock Market Forecasting Methods · Digital Platforms and Economics · Complex Systems and Time Series Analysis
