Evaluating and Enhancing Robustness of Deep Recommendation Systems Against Hardware Errors
Dongning Ma, Xun Jiao, Fred Lin, Mengshi Zhang, Alban Desmaison,, Thomas Sellinger, Daniel Moore, Sriram Sankar

TL;DR
This paper systematically studies the robustness of deep recommendation systems against hardware errors, introduces an error injection framework, and evaluates mitigation techniques like activation clipping to improve system resilience.
Contribution
It is the first to analyze DRS robustness against hardware errors, develop Terrorch for error injection, and evaluate mitigation strategies such as activation clipping.
Findings
Activation clipping recovers up to 30% of degraded AUC-ROC scores.
DRS robustness varies with model parameters and input characteristics.
Error mitigation methods can significantly improve system resilience.
Abstract
Deep recommendation systems (DRS) heavily depend on specialized HPC hardware and accelerators to optimize energy, efficiency, and recommendation quality. Despite the growing number of hardware errors observed in large-scale fleet systems where DRS are deployed, the robustness of DRS has been largely overlooked. This paper presents the first systematic study of DRS robustness against hardware errors. We develop Terrorch, a user-friendly, efficient and flexible error injection framework on top of the widely-used PyTorch. We evaluate a wide range of models and datasets and observe that the DRS robustness against hardware errors is influenced by various factors from model parameters to input characteristics. We also explore 3 error mitigation methods including algorithm based fault tolerance (ABFT), activation clipping and selective bit protection (SBP). We find that applying activation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Age of Information Optimization · Cloud Computing and Resource Management
