R2F: A Remote Retraining Framework for AIoT Processors with Computing   Errors

Dawen Xu; Meng He; Cheng Liu; Ying Wang; Long Cheng; Huawei Li,; Xiaowei Li; Kwang-Ting Cheng

arXiv:2107.03096·cs.AR·July 8, 2021

R2F: A Remote Retraining Framework for AIoT Processors with Computing Errors

Dawen Xu, Meng He, Cheng Liu, Ying Wang, Long Cheng, Huawei Li,, Xiaowei Li, Kwang-Ting Cheng

PDF

Open Access

TL;DR

This paper introduces R2F, a remote retraining framework for AIoT processors with soft errors, improving model resilience and accuracy while optimizing data transmission and retraining efficiency.

Contribution

The paper proposes a novel remote retraining framework (R2F) for AIoT processors with errors, including an optimized partial TMR strategy and a sparse increment compression method.

Findings

01

Top-5 accuracy improved by up to 13.73%

02

Retraining time reduced by 38%-88% with minimal accuracy loss

03

R2F balances accuracy and performance penalties effectively

Abstract

AIoT processors fabricated with newer technology nodes suffer rising soft errors due to the shrinking transistor sizes and lower power supply. Soft errors on the AIoT processors particularly the deep learning accelerators (DLAs) with massive computing may cause substantial computing errors. These computing errors are difficult to be captured by the conventional training on general purposed processors like CPUs and GPUs in a server. Applying the offline trained neural network models to the edge accelerators with errors directly may lead to considerable prediction accuracy loss. To address the problem, we propose a remote retraining framework (R2F) for remote AIoT processors with computing errors. It takes the remote AIoT processor with soft errors in the training loop such that the on-site computing errors can be learned with the application data on the server and the retrained models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Advanced Neural Network Applications · Machine Learning and ELM