Cimple: Instruction and Memory Level Parallelism
Vladimir Kiriansky, Haoran Xu, Martin Rinard, Saman Amarasinghe

TL;DR
Cimple introduces a coroutine-based programming model for instruction and memory level parallelism, significantly improving performance in memory-intensive workloads by enabling better task scheduling and execution strategies.
Contribution
The paper presents the IMLP task programming model and the Cimple DSL, which together enhance ILP and MLP exploitation in workloads with large working sets, achieving state-of-the-art performance.
Findings
2.5x throughput improvement over hardware multithreading
6.4x single-thread speedup in core algorithms
Effective integration of task scheduling with vectorization and prefetching
Abstract
Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for in-flight memory requests. These resources, however, often exhibit poor utilization rates on workloads with large working sets, e.g., in-memory databases, key-value stores, and graph analytics, as compilers and hardware struggle to expose ILP and MLP from the instruction stream automatically. In this paper, we introduce the IMLP (Instruction and Memory Level Parallelism) task programming model. IMLP tasks execute as coroutines that yield execution at annotated long-latency operations, e.g., memory accesses, divisions, or unpredictable branches. IMLP tasks are interleaved on a single thread, and integrate well with thread parallelism and vectorization.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
