TL;DR
This paper introduces methodologies and an open-source toolkit called flor for efficient hindsight logging in model training, enabling faster debugging and analysis with minimal overhead and resource consumption.
Contribution
It presents novel techniques for background logging, checkpointing, and automatic record-replay in model training, inspired by database recovery methods.
Findings
flor achieves around 7% overhead in checkpointing
Hindsight log replay is orders of magnitude faster than retraining from scratch
The toolkit is adaptable and easy to integrate into existing workflows
Abstract
In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible. Optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint -- a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
