Technical Report: Developing a Working Data Hub
Vijay Gadepally, Jeremy Kepner

TL;DR
This report discusses the essential features, challenges, and best practices for developing a functional data hub to manage and access organizational data efficiently, supporting AI and machine learning initiatives.
Contribution
It provides a comprehensive overview of data hub components, challenges, and practical guidelines for implementation in enterprise settings.
Findings
Identifies key characteristics of data hubs.
Highlights challenges in data management and access control.
Recommends best practices for deployment and operation.
Abstract
Data forms a key component of any enterprise. The need for high quality and easy access to data is further amplified by organizations wishing to leverage machine learning or artificial intelligence for their operations. To this end, many organizations are building resources for managing heterogenous data, providing end-users with an organization wide view of available data, and acting as a centralized repository for data owned/collected by an organization. Very broadly, we refer to these class of techniques as a "data hub." While there is no clear definition of what constitutes a data hub, some of the key characteristics include: data catalog; links to data sets or owners of data sets or centralized data repository; basic ability to serve / visualize data sets; access control policies that ensure secure data access and respects policies of data owners; and computing capabilities tied…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed systems and fault tolerance · Data Quality and Management
