Garbage Collection or Serialization? Between a Rock and a Hard Place!
Iacovos G. Kolokasis, Giannos Evdorou, Anastasios Papagiannis, and Foivos Zakkak, Christos Kozanitis, Shoaib Akram, Polyvios, Pratikakis, Angelos Bilas

TL;DR
TeraHeap is a system that enhances big data frameworks by eliminating serialization/deserialization overhead and reducing garbage collection costs through a novel off-heap management approach using a second high-capacity heap on fast storage.
Contribution
It introduces TeraHeap, which extends JVM to use a second heap on fast storage, reducing S/D and GC costs, and provides a hint-based interface for frameworks to optimize object placement.
Findings
Up to 73% performance improvement in Spark.
Up to 8x less DRAM usage compared to native Spark.
Outperforms Panthera garbage collector by up to 69%.
Abstract
Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive amounts of data that do not always fit on the heap. Therefore, frameworks temporarily move long-lived objects outside the managed heap (off-heap) on a fast storage device. Unfortunately, this practice results in: (1) high serialization/deserialization (S/D) cost, and (2) high memory pressure when off-heap objects are moved back to the managed heap for processing. In this paper, we propose TeraHeap, a system that eliminates S/D overhead and expensive GC scans for a large portion of the objects in big data frameworks. TeraHeap relies on three concepts. (1) It eliminates S/D cost by extending the managed runtime (JVM) to use a second high-capacity heap (H2) over a fast storage device. (2) It reduces GC cost by fencing the garbage collector from scanning H2 objects. (3) It offers a simple hint-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Parallel Computing and Optimization Techniques
