TL;DR
This paper introduces a novel GPU framework for tensor decomposition that efficiently handles out-of-memory data, significantly accelerating computations and outperforming existing methods on real-world sparse tensors.
Contribution
The paper proposes the BLCO format and adaptive strategies for out-of-memory tensor operations, enabling efficient, conflict-resolving parallel computations on GPUs.
Findings
Achieves 2.12-2.6X speedup over state-of-the-art methods.
Supports out-of-memory tensor processing on GPUs.
Reduces synchronization costs and improves in-memory performance.
Abstract
Tensor decomposition (TD) is an important method for extracting latent information from high-dimensional (multi-modal) sparse data. This study presents a novel framework for accelerating fundamental TD operations on massively parallel GPU architectures. In contrast to prior work, the proposed Blocked Linearized Coordinate (BLCO) format enables efficient out-of-memory computation of tensor algorithms using a unified implementation that works on a single tensor copy. Our adaptive blocking and linearization strategies not only meet the resource constraints of GPU devices, but also accelerate data indexing, eliminate control-flow and memory-access irregularities, and reduce kernel launching overhead. To address the substantial synchronization cost on GPUs, we introduce an opportunistic conflict resolution algorithm, in which threads collaborate instead of contending on memory access to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
