Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance
Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin, Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, Onur Mutlu

TL;DR
This paper introduces a methodology to measure application error tolerance, enabling tailored memory reliability in datacenters, which reduces costs while maintaining high server availability.
Contribution
It presents a new way to quantify application memory error tolerance, analyzes three workloads for error vulnerability, and proposes heterogeneous-reliability memory designs to optimize datacenter costs.
Findings
Memory error tolerance varies across applications.
Cost can be reduced by 4.7% with maintained 99.90% availability.
Heterogeneous-reliability memory designs are effective.
Abstract
This paper summarizes our work on characterizing application memory error vulnerability to optimize datacenter cost via Heterogeneous-Reliability Memory (HRM), which was published in DSN 2014, and examines the work's significance and future potential. Memory devices represent a key component of datacenter total cost of ownership (TCO), and techniques used to reduce errors that occur on these devices increase this cost. Existing approaches to providing reliability for memory devices pessimistically treat all data as equally vulnerable to memory errors. Our key insight is that there exists a diverse spectrum of tolerance to memory errors in new data-intensive applications, and that traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. This presents an opportunity to greatly reduce server hardware cost by provisioning the right amount of memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
