Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle
Rolando Garcia, Pragya Kallanagoudar, Chithra Anand, Sarah E. Chasins,, Joseph M. Hellerstein, Erin Michelle Turner Kerrison, Aditya G. Parameswaran

TL;DR
FlorDB introduces an incremental, non-intrusive method for harvesting and querying metadata in machine learning pipelines, enhancing flexibility and agility without sacrificing discipline.
Contribution
The paper presents FlorDB, a system enabling post-hoc metadata collection and dynamic querying in ML workflows, bridging gaps between agile development and metadata management.
Findings
Hindsight logging allows post-hoc metadata addition.
Relational views enable dynamic metadata materialization.
System supports diverse metadata types and integrates with existing ML tools.
Abstract
In this paper we present techniques to incrementally harvest and query arbitrary metadata from machine learning pipelines, without disrupting agile practices. We center our approach on the developer-favored technique for generating metadata -- log statements -- leveraging the fact that logging creates context. We show how hindsight logging allows such statements to be added and executed post-hoc, without requiring developer foresight. Relational views of incomplete metadata can be queried to dynamically materialize new metadata in bulk and on demand across multiple versions of workflows. This is done in a "metadata later" style, off the critical path of agile development. We realize these ideas in a system called FlorDB and demonstrate how the data context framework covers a range of both ad-hoc metadata as well as special cases treated today by bespoke feature stores and model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Distributed and Parallel Computing Systems
