POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging
Shishir G. Patil, Paras Jain, Prabal Dutta, Ion Stoica, Joseph E., Gonzalez

TL;DR
POET is a novel algorithm that enables training large neural networks on memory-constrained edge devices by jointly optimizing rematerialization and paging, significantly improving energy efficiency and model size capabilities.
Contribution
It introduces a MILP-based approach to optimize memory and energy use for training neural networks on embedded devices without altering backpropagation correctness.
Findings
Enables fine-tuning ResNet-18 and BERT on Cortex-M devices.
Reduces energy consumption compared to existing edge training methods.
Supports training larger models within strict memory constraints.
Abstract
Fine-tuning models on edge devices like mobile phones would enable privacy-preserving personalization over sensitive data. However, edge training has historically been limited to relatively small models with simple architectures because training is both memory and energy intensive. We present POET, an algorithm to enable training large neural networks on memory-scarce battery-operated edge devices. POET jointly optimizes the integrated search search spaces of rematerialization and paging, two algorithms to reduce the memory consumption of backpropagation. Given a memory budget and a run-time constraint, we formulate a mixed-integer linear program (MILP) for energy-optimal training. Our approach enables training significantly larger models on embedded devices while reducing energy consumption while not modifying mathematical correctness of backpropagation. We demonstrate that it is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Age of Information Optimization
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Attention Dropout · Dense Connections · Layer Normalization · Weight Decay · Linear Warmup With Linear Decay · WordPiece
