Hindsight Logging for Model Training

Rolando Garcia; Eric Liu; Vikram Sreekanti; Bobby Yan; Anusha; Dandamudi; Joseph E. Gonzalez; Joseph M. Hellerstein; Koushik Sen

arXiv:2006.07357·cs.DC·December 3, 2020

Hindsight Logging for Model Training

Rolando Garcia, Eric Liu, Vikram Sreekanti, Bobby Yan, Anusha, Dandamudi, Joseph E. Gonzalez, Joseph M. Hellerstein, Koushik Sen

PDF

1 Repo

TL;DR

This paper introduces methodologies and an open-source toolkit called flor for efficient hindsight logging in model training, enabling faster debugging and analysis with minimal overhead and resource consumption.

Contribution

It presents novel techniques for background logging, checkpointing, and automatic record-replay in model training, inspired by database recovery methods.

Findings

01

flor achieves around 7% overhead in checkpointing

02

Hindsight log replay is orders of magnitude faster than retraining from scratch

03

The toolkit is adaptable and easy to integrate into existing workflows

Abstract

In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible. Optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint -- a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucbrise/flor
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.