POET: Training Neural Networks on Tiny Devices with Integrated   Rematerialization and Paging

Shishir G. Patil; Paras Jain; Prabal Dutta; Ion Stoica; Joseph E.; Gonzalez

arXiv:2207.07697·cs.LG·July 19, 2022·6 cites

POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging

Shishir G. Patil, Paras Jain, Prabal Dutta, Ion Stoica, Joseph E., Gonzalez

PDF

Open Access 1 Repo

TL;DR

POET is a novel algorithm that enables training large neural networks on memory-constrained edge devices by jointly optimizing rematerialization and paging, significantly improving energy efficiency and model size capabilities.

Contribution

It introduces a MILP-based approach to optimize memory and energy use for training neural networks on embedded devices without altering backpropagation correctness.

Findings

01

Enables fine-tuning ResNet-18 and BERT on Cortex-M devices.

02

Reduces energy consumption compared to existing edge training methods.

03

Supports training larger models within strict memory constraints.

Abstract

Fine-tuning models on edge devices like mobile phones would enable privacy-preserving personalization over sensitive data. However, edge training has historically been limited to relatively small models with simple architectures because training is both memory and energy intensive. We present POET, an algorithm to enable training large neural networks on memory-scarce battery-operated edge devices. POET jointly optimizes the integrated search search spaces of rematerialization and paging, two algorithms to reduce the memory consumption of backpropagation. Given a memory budget and a run-time constraint, we formulate a mixed-integer linear program (MILP) for energy-optimal training. Our approach enables training significantly larger models on embedded devices while reducing energy consumption while not modifying mathematical correctness of backpropagation. We demonstrate that it is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shishirpatil/poet
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Age of Information Optimization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Attention Dropout · Dense Connections · Layer Normalization · Weight Decay · Linear Warmup With Linear Decay · WordPiece