Enterprise Data Science Platform: A Unified Architecture for Federated Data Access
Ryoto Miyamoto, Akira Kasuga

TL;DR
This paper introduces the Enterprise Data Science Platform (EDSP), a unified architecture based on data lakehouse principles that enables federated data access across multiple analytical environments, reducing data duplication and operational complexity.
Contribution
The paper presents EDSP, a novel multi-layer architecture that facilitates federated data access, interoperability, and reduces data duplication in multi-query engine environments.
Findings
EDSP reduces operational steps by 33-44% compared to traditional data migration approaches.
Major cloud data warehouses and programming environments can directly query EDSP datasets.
End-to-end query response times remain practical, within seconds, despite increased latency.
Abstract
Organizations struggle to share data across departments that have adopted different data analytics platforms. If n datasets must serve m environments, up to n*m replicas can emerge, increasing inconsistency and cost. Traditional warehouses copy data into vendor-specific stores; cross-platform access is hard. This study proposes the Enterprise Data Science Platform (EDSP), which builds on data lakehouse architecture and follows a Write-Once, Read-Anywhere principle. EDSP enables federated data access for multi-query engine environments, targeting data science workloads with periodic data updates and query response times ranging from seconds to minutes. By providing centralized data management with federated access from multiple query engines to the same data sources, EDSP eliminates data duplication and vendor lock-in inherent in traditional data warehouses. The platform employs a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Database Systems and Queries · Distributed systems and fault tolerance
