# SpeedMalloc: Improving Multi-threaded Applications via a Lightweight Core for Memory Allocation

**Authors:** Ruihao Li, Qinzhe Wu, Krishna Kavi, Gayatri Mehta, Jonathan C. Beard, Neeraja J. Yadwadkar, Lizy K. John

arXiv: 2508.20253 · 2025-08-29

## TL;DR

SpeedMalloc introduces a lightweight support-core for multi-threaded memory allocation, significantly reducing cache conflicts and synchronization overhead, leading to substantial performance improvements over existing allocators.

## Contribution

It proposes a novel support-core architecture for memory allocators that enhances multi-threaded performance by minimizing cache conflicts and enabling flexible allocator design.

## Key findings

- SpeedMalloc achieves up to 1.75x speedup over state-of-the-art allocators.
- It reduces cache pollution and synchronization overhead in multi-threaded environments.
- Performance gains are consistent across various multi-threaded workloads.

## Abstract

Memory allocation, though constituting only a small portion of the executed code, can have a "butterfly effect" on overall program performance, leading to significant and far-reaching impacts. Despite accounting for just approximately 5% of total instructions, memory allocation can result in up to a 2.7x performance variation depending on the allocator used. This effect arises from the complexity of memory allocation in modern multi-threaded multi-core systems, where allocator metadata becomes intertwined with user data, leading to cache pollution or increased cross-thread synchronization overhead. Offloading memory allocators to accelerators, e.g., Mallacc and Memento, is a potential direction to improve the allocator performance and mitigate cache pollution. However, these accelerators currently have limited support for multi-threaded applications, and synchronization between cores and accelerators remains a significant challenge.   We present SpeedMalloc, using a lightweight support-core to process memory allocation tasks in multi-threaded applications. The support-core is a lightweight programmable processor with efficient cross-core data synchronization and houses all allocator metadata in its own caches. This design minimizes cache conflicts with user data and eliminates the need for cross-core metadata synchronization. In addition, using a general-purpose core instead of domain-specific accelerators makes SpeedMalloc capable of adopting new allocator designs. We compare SpeedMalloc with state-of-the-art software and hardware allocators, including Jemalloc, TCMalloc, Mimalloc, Mallacc, and Memento. SpeedMalloc achieves 1.75x, 1.18x, 1.15x, 1.23x, and 1.18x speedups on multithreaded workloads over these five allocators, respectively.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20253/full.md

## Figures

18 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20253/full.md

## References

62 references — full list in the complete paper: https://tomesphere.com/paper/2508.20253/full.md

---
Source: https://tomesphere.com/paper/2508.20253