Version Control System for Data with MatrixOne
Hongshen Gou, Feng Tian, Long Wang, Nan Deng, Peng Xu

TL;DR
This paper introduces a scalable, git-like version control system for large datasets built into MatrixOne, enabling efficient diff, merge, and branching operations for data management.
Contribution
It presents a novel data version control system leveraging MatrixOne's architecture, supporting complete VCS operations at terabyte scale with high performance.
Findings
Supports clone, tag, branch, diff, merge, revert operations.
Enables data engineering workflows with isolated development and atomic publishing.
Operates efficiently on terabyte-scale datasets with near-instantaneous performance.
Abstract
The rapid advancement of artificial intelligence has elevated data to a cornerstone of modern software systems. As data projects become increasingly complex and dynamic, version control for data has become essential rather than merely convenient. Existing version control systems designed for source code are inadequate for large-scale data management, as they often require loading entire datasets into memory for diff and merge operations. Database systems, while providing robust data management capabilities, lack native support for version control operations such as diff and merge between data forks. We present a version control system for data implemented in MatrixOne, a cloud-native relational database system. Our system leverages MatrixOne's immutable storage architecture and multi-version concurrency control (MVCC) to enable git-like operations on database tables at scale. The system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
