Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels
Shaoliang Yang, Jun Wang, and Yunsheng Wang

TL;DR
This paper introduces a fused CUDA kernel for 3D SIMP topology optimization that significantly accelerates computation and reduces energy consumption compared to traditional multi-stage methods.
Contribution
A novel single-kernel implementation for gather, stiffness multiplication, and scatter in 3D SIMP optimization improves speed and energy efficiency over conventional approaches.
Findings
Achieves up to 7.3x speedup on large problems.
Reduces energy consumption by up to 4.9x.
Demonstrates effective BF16 variant with high problem conditioning.
Abstract
The matrix-free gather-batched-GEMM-scatter pattern eliminates global stiffness assembly for three-dimensional SIMP topology optimization, but the conventional three-stage implementation forces avoidable DRAM traffic between stages. We present a single fused CUDA kernel, implemented through CuPy's runtime compilation interface, that performs gather, per-element stiffness multiplication, and scatter accumulation in one pass. On a single RTX 4090 (24 GB), the fused path reaches a problem-size-dependent 4.6-7.3x end-to-end SIMP wall-time speedup across 216k-4.9M cantilever elements and 4.4x on the 499,125-element torsion benchmark. Against the same-precision FP32 three-stage baseline, the fused path still yields 2.3-4.6x on cantilever and 2.8x on torsion. Isolated CUDA-event cantilever-operator measurements reach 8.9-13.8x per matvec call, while separate instrumented board-power traces at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
