Segmented Operations using Matrix Multiplications
Aleksandros Sobczyk, Giuseppe Sorrentino, Anastasios Zouzias

TL;DR
This paper introduces MMV-RAM, a new computational model combining matrix and vector units to optimize AI operations, demonstrating theoretical speed-ups and practical implementations on AI accelerators for diverse workloads.
Contribution
The paper proposes MMV-RAM, a theoretical model extending Vector-RAM with matrix units, and develops algorithms that leverage this model for improved AI kernel performance.
Findings
Theoretical analysis shows speed-ups over vector-only approaches.
Algorithms effectively implement segmented scan and sum using MMUs.
Practical implementation on Ascend 910B accelerators demonstrates real-world benefits.
Abstract
Specialized computational units that perform small matrix multiplications as primitive operations are typically present in modern AI accelerators. However, these Matrix Multiplication Units (MMUs) are often underutilized for many fundamental deep learning operations besides dense matrix multiplications. Coincidentally, the lack of a rigorous theoretical model of computation for such architectures obstructs algorithmic design. In this work, we propose MMV-RAM, a computational model which judiciously extends the Vector-RAM model with an additional MMU. We provide a detailed theoretical analysis and carefully balance the computational power between the matrix and vector units, guided by the circuit complexity lower bound that parity is not in AC{[0]}. Given MMV-RAM, we proceed to algorithm design, starting with two fundamental parallel operations: segmented scan and sum. By expressing them…
Peer Reviews
Decision·Submitted to ICLR 2026
Please see the 'Summary'.
Please see the 'Summary'.
Theoretical Guarantees: The paper provides a formal theoretical analysis, proving that its algorithms achieve a step complexity of O(log_s(n)). This is provably faster than any vector-only algorithm, which is lower-bounded. Novelty - MMV-RAM model. The paper addresses a key gap left by the prior "TCU model" by formally including the Vector Unit (VCU). This makes it a more accurate theoretical representation of modern accelerators like TPUs, NVIDIA GPUs, and Ascend NPUs, which all have both mat
Doubt on Generalization, Requirement of Custom Hardware: The experimental speed-ups are demonstrated on a Huawei Ascend 910B using the proprietary AscendC programming framework. While the paper lists analogues (e.g., NVIDIA Tensor Cores), the results are not on commodity hardware, making them less generalizable. Theoretical vs. Practical Complexity: The most work-efficient algorithm presented (Theorem 4.3) is admitted to be "rather involved" and requires "specialized circuitry that might not be
Demonstrates 30–40% performance gains in efficiency and reduced latency compared to traditional methods. The segmented model is straightforward and applicable to real systems, with clear diagrams and structured explanations.
The paper lacks a formal analysis of segmentation boundaries and complexity trade-offs. Evaluation is limited to local or small-cluster setups; performance on large distributed systems remains untested. Missing information about hardware specs, configuration, and code availability, making reproducibility difficult. Results are presented clearly but lack significance testing (e.g., error bars, confidence intervals). Resource usage implications of segmentation are not deeply explored.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Graph Theory and Algorithms
