# New Tools, Programming Models, and System Support for Processing-in-Memory Architectures

**Authors:** Geraldo F. Oliveira

arXiv: 2508.19868 · 2025-08-28

## TL;DR

This dissertation develops new tools, models, and system support to facilitate the adoption of processing-in-memory architectures, especially DRAM-based solutions, by addressing data movement, programmability, and latency challenges.

## Contribution

It introduces four major innovations: a data movement characterization methodology, a hardware/software co-designed substrate for PUD, a runtime engine to reduce PUD latency, and a programming framework for PIM.

## Key findings

- DAMOV characterizes memory bottlenecks and benchmarks data movement.
- MIMDRAM improves programmability and flexibility of PUD architectures.
- Proteus significantly reduces PUD operation latency through concurrency, data precision reduction, and adaptive algorithms.

## Abstract

Our goal in this dissertation is to provide tools, programming models, and system support for PIM architectures (with a focus on DRAM-based solutions), to ease the adoption of PIM in current and future systems. To this end, we make at least four new major contributions.   First, we introduce DAMOV, the first rigorous methodology to characterize memory-related data movement bottlenecks in modern workloads, and the first data movement benchmark suite. Second, we introduce MIMDRAM, a new hardware/software co-designed substrate that addresses the major current programmability and flexibility limitations of the bulk bitwise execution model of processing-using-DRAM (PUD) architectures. MIMDRAM enables the allocation and control of only the needed computing resources inside DRAM for PUD computing. Third, we introduce Proteus, the first hardware framework that addresses the high execution latency of bulk bitwise PUD operations in state-of-the-art PUD architectures by implementing a data-aware runtime engine for PUD. Proteus reduces the latency of PUD operations in three different ways: (i) Proteus concurrently executes independent in-DRAM primitives belong to a single PUD operation across DRAM arrays. (ii) Proteus dynamically reduces the bit-precision (and consequentially the latency and energy consumption) of PUD operations by exploiting narrow values (i.e., values with many leading zeros or ones). (iii) Proteus chooses and uses the most appropriate data representation and arithmetic algorithm implementation for a given PUD instruction transparently to the programmer. Fourth, we introduce DaPPA (data-parallel processing-in-memory architecture), a new programming framework that eases programmability for general-purpose PNM architectures by allowing the programmer to write efficient PIM-friendly code without the need to manage hardware resources explicitly.

## Figures

40 figures with captions in the complete paper: https://tomesphere.com/paper/2508.19868/full.md

---
Source: https://tomesphere.com/paper/2508.19868