Locality-Aware Automatic Differentiation on the GPU for Mesh-Based Computations

Ahmed H. Mahmoud; Rahul Goel; Jonathan Ragan-Kelley; Justin Solomon

arXiv:2509.00406·cs.GR·February 3, 2026

Locality-Aware Automatic Differentiation on the GPU for Mesh-Based Computations

Ahmed H. Mahmoud, Rahul Goel, Jonathan Ragan-Kelley, Justin Solomon

PDF

Open Access

TL;DR

This paper introduces a GPU-based automatic differentiation system optimized for mesh-based computations, leveraging locality and sparsity to improve efficiency in various applications.

Contribution

It presents a novel GPU system that performs automatic differentiation on meshes without global graphs, reducing memory traffic and supporting dynamic sparsity and matrix-free operations.

Findings

01

Outperforms state-of-the-art frameworks like PyTorch and JAX.

02

Achieves significant speedups in various mesh-based applications.

03

Supports diverse solver types and derivative modes.

Abstract

We present a GPU-based system for automatic differentiation (AD) of functions defined on triangle meshes, designed to exploit the locality and sparsity in mesh-based computation. Our system evaluates derivatives using per-element forward-mode AD, confining all computation to registers and shared memory and assembling global gradients, sparse Jacobians, and sparse Hessians directly on the GPU. By avoiding global computation graphs, intermediate buffers, and device-host synchronization, our approach minimizes memory traffic and enables efficient differentiation under both static and dynamically changing sparsity. Our programming model lets users express energy terms over mesh neighborhoods, while our system automatically manages parallel execution, derivative propagation, sparse assembly, and matrix-free operations such as Hessian-vector products. Our system supports both scalar and…

Tables4

Table 1. Table 1 . Mass-spring performance. We report only the time per time step spent evaluating gradients and Hessians, excluding other components such as the linear solver. Results are averaged over 1000 time steps on meshes of varying sizes, identified by their number of vertices.

# V	PyTorch (ms)	IndexedSum (ms)	Ours (ms)
$10^{2}$	269.8	17.07	0.18
$50^{2}$	7,205.6	18.72	0.22
$100^{2}$	29,516.76	11.74	0.25
$500^{2}$	OOM	15.73	3.09
$1000^{2}$	OOM	39.67	11.7

Table 2. Table 2 . Performance of the Newton solver using matrix-free Conjugate Gradient (CG) for mesh parameterization. Instead of constructing the full Hessian, we compute Hessian-vector products on-the-fly during each CG iteration. The reported time is the average per CG iteration. .

# V ( $\times 10^{6}$ )	PyTorch (ms)	Ours (ms)	Speedup
1.55 $\times 10^{- 3}$	0.039	0.026	1.4 $\times$
0.16	0.27	0.108	2.4 $\times$
0.56	0.896	0.383	2.3 $\times$
1.52	2.475	0.817	3.03 $\times$
1.83	2.987	1.102	2.71 $\times$

Table 3. Table 3 . Runtime breakdown for 60 GN iterations on 253k-face mesh.

Stage	T (ms)
Construct $J$	304.4
Assemble $J$	98.7
Line Search	16.7
Linear Solver	8089.8

Table 4. Table 4 . Per-frame runtime breakdown of the scene in Figure 1

Stage	Time (s)
Contact Detect	97.9 (49.1%)
Linear Solver	76.2 (38.2%)
Energy Eval	21.1 (10.5%)
Hessian Update	3.3 (1.7%)
Line Search	0.6 (0.27%)
Misc	0.06 (0.03%)

Equations36

F (x) = j \in E \sum f_{j} (x_{j}),

F (x) = j \in E \sum f_{j} (x_{j}),

g = j \in E \sum S_{j}^{⊤} g_{j} and H = j \in E \sum S_{j}^{⊤} H_{j} S_{j} .

g = j \in E \sum S_{j}^{⊤} g_{j} and H = j \in E \sum S_{j}^{⊤} H_{j} S_{j} .

J = j \in E \sum P_{j}^{⊤} J_{j} S_{j},

J = j \in E \sum P_{j}^{⊤} J_{j} S_{j},

E (x) = \frac{1}{2} ∥ x - (x^{n} + h v^{n}) ∥_{M}^{2} + h^{2} P (x),

E (x) = \frac{1}{2} ∥ x - (x^{n} + h v^{n}) ∥_{M}^{2} + h^{2} P (x),

P_{e} (x) = l^{2} \frac{1}{2} k (\frac{∥ x _{i} - x _{j} ∥ ^{2}}{l ^{2}} - 1)^{2},

P_{e} (x) = l^{2} \frac{1}{2} k (\frac{∥ x _{i} - x _{j} ∥ ^{2}}{l ^{2}} - 1)^{2},

f (x) = t \in T \sum area_{t} \cdot (∥ J_{t} (x) ∥_{F}^{2} + J_{t} (x)^{- 1}_{F}^{2}),

f (x) = t \in T \sum area_{t} \cdot (∥ J_{t} (x) ∥_{F}^{2} + J_{t} (x)^{- 1}_{F}^{2}),

E = w_{fit} E_{fit} + w_{r e g} E_{reg} + w_{rot} E_{rot} .

E = w_{fit} E_{fit} + w_{r e g} E_{reg} + w_{rot} E_{rot} .

E_{fit} = i \in V \sum ⎩ ⎨ ⎧ ∥ o_{i} ∥^{2}, ∥ o_{i} - c_{i} ∥^{2}, 0, i fixed, i displaced, otherwise,

E_{fit} = i \in V \sum ⎩ ⎨ ⎧ ∥ o_{i} ∥^{2}, ∥ o_{i} - c_{i} ∥^{2}, 0, i fixed, i displaced, otherwise,

E_{reg} = (i, j) \in E \sum ∥ (x_{j} - x_{i}) - R_{i} (u_{j} - u_{i}) ∥^{2} .

E_{reg} = (i, j) \in E \sum ∥ (x_{j} - x_{i}) - R_{i} (u_{j} - u_{i}) ∥^{2} .

E_{rot} = i \in V \sum (

E_{rot} = i \in V \sum (

+ (∥ c_{0, i} ∥^{2} - 1) + (∥ c_{1, i} ∥^{2} - 1) + (∥ c_{2, i} ∥^{2} - 1))^{2} .

x \in R^{2∣ V ∣} min (i, j, k) \in F \sum [

x \in R^{2∣ V ∣} min (i, j, k) \in F \sum [

+ ∥ p_{i} - p_{j} ∥^{2} + ∥ p_{j} - p_{k} ∥^{2} + ∥ p_{k} - p_{i} ∥^{2}]

R (s_{i}, x_{i}) = \frac{s _{i} + x _{i, 1} \cdot b _{1, i} + x _{i, 2} \cdot b _{2, i}}{∥ s _{i} + x _{i, 1} \cdot b _{1, i} + x _{i, 2} \cdot b _{2, i} ∥}

R (s_{i}, x_{i}) = \frac{s _{i} + x _{i, 1} \cdot b _{1, i} + x _{i, 2} \cdot b _{2, i}}{∥ s _{i} + x _{i, 1} \cdot b _{1, i} + x _{i, 2} \cdot b _{2, i} ∥}

E (x) = (i, j, k) \in F \sum A_{ij k}

E (x) = (i, j, k) \in F \sum A_{ij k}

x_{i} \leftarrow x_{i} - λ \frac{\partial E}{\partial x _{i}},

x_{i} \leftarrow x_{i} - λ \frac{\partial E}{\partial x _{i}},

E_{bend} (θ) = k_{b} \frac{A}{3} (θ - θ^{'})^{2} h^{2}

E_{bend} (θ) = k_{b} \frac{A}{3} (θ - θ^{'})^{2} h^{2}

E_{inertia} (x) = \frac{1}{2} m (x - x^{'})^{2}

E_{inertia} (x) = \frac{1}{2} m (x - x^{'})^{2}

E_{box} (d) = \frac{κ}{2} \hat{d} A_{c} (s - 1) ln (s) h^{2}

E_{box} (d) = \frac{κ}{2} \hat{d} A_{c} (s - 1) ln (s) h^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · 3D Shape Modeling and Analysis · Computer Graphics and Visualization Techniques

Full text

\minted@def@optcl

envname-P envname#1 \setcctypeby

Locality-Aware Automatic Differentiation on the GPU for Mesh-Based Computations

Ahmed H. Mahmoud

Computer Science & Artificial Intelligence LaboratoryMassachusetts Institute of Technology32 Vassar StCambridgeMA02139USA

[email protected]

0000-0003-1857-913X

,

Rahul Goel

Computer Science & Artificial Intelligence LaboratoryMassachusetts Institute of Technology32 Vassar StCambridgeMA02139USA

[email protected]

0000-0002-9564-4022

,

Jonathan Ragan-Kelley

Computer Science & Artificial Intelligence LaboratoryMassachusetts Institute of Technology32 Vassar StCambridgeMA02139USA

[email protected]

0000-0001-6243-9543

and

Justin Solomon

Computer Science & Artificial Intelligence LaboratoryMassachusetts Institute of Technology32 Vassar StCambridgeMA02139USA

[email protected]

0000-0002-7701-7586

Abstract.

We present a GPU-based system for automatic differentiation (AD) of functions defined on triangle meshes, designed to exploit the locality and sparsity in mesh-based computation. Our system evaluates derivatives using per-element forward-mode AD, confining all computation to registers and shared memory and assembling global gradients, sparse Jacobians, and sparse Hessians directly on the GPU. By avoiding global computation graphs, intermediate buffers, and device-host synchronization, our approach minimizes memory traffic and enables efficient differentiation under both static and dynamically changing sparsity. Our programming model lets users express energy terms over mesh neighborhoods, while our system automatically manages parallel execution, derivative propagation, sparse assembly, and matrix-free operations such as Hessian-vector products. Our system supports both scalar- and vector-valued objectives, dynamic interaction-driven sparsity updates, and seamless integration with external GPU sparse linear solvers. We evaluate our system on applications including elastic and cloth simulation, surface parameterization, mesh smoothing, frame field design, ARAP deformation, and spherical manifold optimization. Across these tasks, our system consistently outperforms state-of-the-art differentiation frameworks, including PyTorch, JAX, Warp, Dr.JIT, EnzymeAD, and Thallo. We demonstrate speedups across a range of solver types, from Newton and Gauss-Newton for nonlinear least squares to L-BFGS and gradient descent, and across different derivative usage modes, including Hessian-vector products as well as full sparse Hessian and Jacobian construction. Our system is available as open source at https://github.com/owensgroup/RXMesh.

††copyright: cc††journal: TOG††journalyear: 2026††journalvolume: 45††journalnumber: 4††article: 50††publicationmonth: 7††doi: 10.1145/3811338††submissionid: 744††ccs: Mathematics of computing Automatic differentiation††ccs: Computing methodologies Massively parallel algorithms††ccs: Computing methodologies Mesh geometry models

1. Introduction

Countless algorithms for science and engineering applications rely on a common building block of evaluating the derivatives of a functional defined over a mesh. Derivatives computation most often arises in optimization-driven tasks in physical simulation, inverse problems, design automation, and geometry processing, where gradients and higher-order derivatives directly determine convergence, stability, and overall runtime. In many of these settings, derivative evaluation dominates the computational cost, making its efficiency a first-order concern rather than just an implementation detail.

The widespread adoption of (stochastic) gradient-based methods in machine learning has spurred the development of highly optimized automatic differentiation (AD) frameworks. Much of this progress has been guided by the needs of machine learning workloads where the dominant computational model consists of dense tensors and large, regular computation graphs. Systems such as PyTorch (Ansel et al., 2024) and JAX (Bradbury et al., 2018) excel in this regime, enabling end-to-end differentiation of neural networks.

In contrast, scientific and graphics workloads, particularly those involving meshes, operate in a different regime. Mesh-based problems often involve large, sparse systems with localized dependencies induced by irregular connectivity (see Figure 4). As resolution increases to meet accuracy requirements in applications such as biomechanics, structural engineering, and fluid simulation, the resulting Jacobians and Hessians grow in size while remaining sparse.

General-purpose AD frameworks fail to exploit this structure since they represent derivatives through dense or implicitly dense intermediate forms, resulting into poor runtime performance and excessive memory usage when applied to mesh-based problems. In practice, derivative evaluation can become a dominant cost, in some cases exceeding the cost of the downstream linear solve itself (see Figure 2). This inefficiency forces practitioners to limit problem size, avoid second-order methods, or resort to hand-derived derivatives to achieve acceptable performance (Huang et al., 2024).

Mesh-based applications therefore require AD systems that preserve sparsity by construction and support both efficient first- and second-order derivatives computation without constraining the choice of numerical solver. While prior mesh-oriented AD tools exist (see §2.3), they typically target a narrow derivative order, assume specific solver structures, or fail to scale efficiently to large systems.

AD in mesh-based computations on the GPU is memory-bound, i.e., the primary cost lies not in arithmetic operations but in accessing and updating large, sparse data structures. Modern GPUs provide extremely high bandwidth and low latency in registers and shared memory but only when computation is organized to exploit locality. Conventional AD systems struggle to do so where their derivative representations induce scattered reads and writes to global memory, preventing effective use of the GPU memory hierarchy. As a result, much of the available bandwidth remains unused.

In this paper, we present a GPU AD system tailored to the sparsity and locality of computation on triangle meshes. Our system exploits this locality to perform differentiation at the level of individual mesh elements. Each gradient, Jacobian, or Hessian contribution depends only on a small, fixed neighborhood, which allows us to keep differentiation entirely within registers or shared memory and limit global memory traffic (Figure 3).

Our system supports efficient computation of first- and second-order derivatives, i.e., gradients, sparse Jacobians, sparse Hessians, and Hessian–vector products. We use forward-mode AD via operator overloading to evaluate per-element derivatives independently, assemble sparse derivatives directly on the GPU, and optionally operate in matrix-free mode. In addition to fixed mesh stencils, the system supports dynamic pairwise interactions that introduce new couplings between elements during execution, requiring updates to the Hessian sparsity pattern. By preallocating sparse structures from mesh connectivity and managing sparsity updates explicitly on the GPU, we avoid dynamic computation graph construction and enable fully parallel execution under both static and dynamically evolving sparsity. Our implementation is open source and available at https://github.com/owensgroup/RXMesh.

With a focus on triangle meshes, we design a system for sparse differentiation on meshes to achieve the following design goals:

(1)

Performance: Achieve high performance for the core operations in mesh-based optimization pipelines that require sparse differentiation, including assembling sparse Hessians and Jacobians, evaluating Hessian–vector products, and handling dynamically changing sparsity patterns. 2. (2)

Robustness: Support a broad range of mesh-based operations and energy formulations, including terms defined on vertices, edges, faces, and dynamically generated interactions (e.g., between different bodies), as well as first- and second-order derivatives. The system places no restrictions on mesh quality, accommodating non-manifold meshes and disconnected components. 3. (3)

Interoperability: Enable easy integration with existing high-performance numerical GPU solvers as well as accelerated spatial data structures (e.g., BVHs). 4. (4)

Decoupling specification from execution: Allow users to specify what local computations define their objective independently of how these computations are evaluated, differentiated, and assembled via an simple programming interface. A single problem specification can be reused across multiple evaluation modes (passive evaluation, full differentiation, and matrix-free derivative products) and across different solver pipelines, without modification.

We test our system on a suite of applications including mass-spring cloth simulation, mesh parameterization, manifold optimization, mesh smoothing, ARAP deformation, curl-free polyvector design, and elastic simulation. We compare against widely used general-purpose AD frameworks, including PyTorch, JAX, Warp, EnzymeAD, and Dr.JIT, as well as the domain-specific nonlinear least-squares system Thallo (Mara et al., 2021). Across all benchmarks, our system consistently achieves higher performance for different workloads that involve sparse first- and second-order derivatives.

In summary, this paper presents a GPU system for automatic differentiation on triangle meshes that preserves sparsity and exploits locality by evaluating derivatives at the level of local mesh neighborhoods. The system supports first- and second-order derivatives, explicit sparse gradient/Jacobian/Hessian construction, Hessian-vector products, and dynamic sparsity updates arising from runtime interactions, all within a unified programming model. Across a range of mesh-based optimization and simulation problems, we show that this design substantially reduces differentiation cost and outperforms existing AD frameworks and domain-specific baselines.

2. Related Work

There are several classical and modern techniques for computing derivatives in scientific computing and optimization (Martins and Ning, 2021). Each method offers different trade-offs in terms of accuracy, efficiency, expressiveness, and implementation effort (Kim and Eberle, 2022, 2020).

Manual differentiation relies on deriving and implementing gradient expressions by hand. While this can produce highly optimized code for fixed formulations, it is labor-intensive, error-prone, and difficult to maintain or extend as models evolve. Symbolic differentiation constructs closed-form derivative expressions by manipulating algebraic representations of a program. This approach can yield exact derivatives and enables algebraic simplification, common subexpression elimination, and other global optimizations. However, it scales poorly to large programs and struggles with control flow, loops, and iterative solvers that are ubiquitous in scientific computing. As a result, symbolic methods are typically restricted to small expressions or serve as building blocks within hybrid systems, and their extension to more general programs remains an active area of research (Herholz et al., 2022; Fernández-Fernández et al., 2025; Herholz et al., 2024).

Finite differences approximate derivatives using truncated Taylor expansions. They are simple to implement and applicable to black-box functions, but are sensitive to step-size selection, suffer from truncation and cancellation errors, and scale poorly with input dimensionality. Complex-step differentiation addresses some numerical issues of finite differences by avoiding subtractive cancellation through complex arithmetic. While highly accurate, it still incurs a linear cost in the number of input variables.

The focus of our work, automatic differentiation (AD), also known as algorithmic differentiation, applies the chain rule directly to program execution and computes derivatives to machine precision. AD combines the generality of finite differences with the accuracy of symbolic methods, while remaining applicable to programs with complex control flow and iterative structure. These properties have made AD the dominant approach for differentiation in modern scientific and machine learning software.

We categorize existing AD systems into two broad groups. The first consists of general-purpose, domain-agnostic frameworks that can differentiate any computation expressed within their programming model. The second is domain-specific systems designed for geometric computation, and in particular for mesh-based workloads. We also provide an overview of RXMesh (Mahmoud et al., 2021) since our system relies on its data structure.

2.1. Automatic Differentiation (AD)

We begin by briefly reviewing the two main modes of AD as they are central to the rest of the paper. For a more comprehensive treatment, we refer readers to the textbooks by Naumann (2012) and Griewank and Walther (2008).

AD is a family of techniques to compute derivatives of functions expressed as computer programs. AD exploits the fact that any such program is a composition of elementary operations with known derivatives allowing for systematic application of the chain rule. There are two primary modes of AD, i.e., forward and reverse mode.

In forward mode, derivatives propagate from inputs to outputs, computing the directional derivative of a function $f:\mathbb{R}^{n}\to\mathbb{R}^{m}$ along a chosen tangent direction. Forward mode maintains a derivative (or dual) value alongside each intermediate variable during program execution. A common implementation strategy is operator overloading where each arithmetic operation is redefined to also compute and propagate the derivative. This approach aligns naturally with the execution order of the original program where each node carries forward both its value and its local derivative.

In reverse mode, derivatives are propagated from outputs to inputs, computing the gradient of a scalar-valued function $f:\mathbb{R}^{n}\to\mathbb{R}$ by traversing the computation graph in reverse. Reverse mode requires first recording a computation graph during the forward pass which captures all intermediate variables and dependencies. During the reverse pass, this graph is traversed backward to apply the chain rule and accumulate gradients with respect to the inputs.

2.2. AD Tools

ADOL-C (Griewank et al., 1996) is a foundational C/C++ library that computes gradients and higher-order derivatives using operator overloading and taping. While broadly applicable, its tape management introduces runtime overhead and limits compiler optimizations, making it less suited for performance-critical tasks. PyTorch (Ansel et al., 2024) and JAX (Bradbury et al., 2018) are more recent AD systems that are widely used in machine learning and numerical computing with efficient support for dense tensor operations. However, their execution models are less effective for sparse, irregular structures or fine-grained control flow typical in scientific computing and geometric data processing.

Enzyme (Moses et al., 2021, 2022) performs AD at the LLVM IR level, enabling compiler-level optimizations that reduce memory and execution overhead—especially in reverse mode. This makes it attractive for integrating AD into existing high-performance codebases. Enoki (Jakob, 2019) is a C++17 library that supports forward and reverse-mode AD with vectorized execution across CPU and GPU. Dr.JIT (Jakob et al., 2022), its successor, compiles high-level Python/C++ code into optimized machine code, enabling efficient differentiable rendering. While effective in rendering contexts, both systems are limited to first-order derivatives and do not exploit the sparsity or locality common in mesh-based computations. Our system builds on the insights behind these tools, adapting them to better suit sparse workloads in mesh processing, leading to improved performance in our target domain.

2.3. Mesh AD Tools

Herholz et al. (2022) proposed a system that applies symbolic differentiation to unoptimized C++ code operating on sparse data. Their approach constructs a global expression graph across the mesh, eliminates redundant subexpressions, and generates vectorized, parallel kernels for CPU or GPU execution. While this yields highly optimized code, it incurs high memory usage and long compilation times. To address these issues, Herholz et al. (2024) introduced a refinement where users define symbolic expressions for a single mesh element. The system compiles these elementwise kernels independently of the global mesh, significantly reducing memory usage and compile times.

TinyAD (Schmidt et al., 2022) is a C++ library designed for sparse optimization. It computes gradients and Hessians by differentiating small, per-element subproblems. Users define local energy terms and the system applies forward-mode AD to compute first- and second-order derivatives. While structurally similar to our method, TinyAD runs only on CPUs and uses OpenMP for parallelism. Our work builds on TinyAD’s structure but is designed explicitly for GPUs by optimizing for memory locality and parallel throughput.

Opt (Devito et al., 2017) is a DSL for nonlinear least squares problems. It uses symbolic differentiation at the IR level and generates first-order optimization code. Thallo (Mara et al., 2021) extends Opt by improving scheduling and memory layout. Unlike our system, these frameworks focus on first-order optimization for nonlinear least squares problems and do not support explicit sparse Hessian construction or Hessian-vector products. Warp (Macklin, 2022) is a Python-based framework for high-performance spatial computing. Users write simulation kernels in Python which are JIT-compiled for CPU or GPU. Warp supports reverse-mode AD by generating backward kernels enabling simulation differentiation and integration with ML pipelines.

Relation to Herholz et al. (2024): Our system and Herholz et al. (2024) address similar challenges in sparse differentiation for mesh-based computation, including support for second-order derivatives and dynamically changing Hessian sparsity induced by interacting mesh elements. Both systems aim to achieve high performance, but differ in how this goal is pursued. Herholz et al. (2024) is based on symbolic differentiation, constructing and optimizing global expression graphs to eliminate redundant computations before generating parallel code via their symbolic backend (Herholz et al., 2022). In contrast, our approach prioritizes memory locality and bandwidth efficiency, organizing differentiation to occur at the finest granularity possible and confining most computation to registers and shared memory on the GPU. While Herholz et al. (2024) focuses primarily on expression-level optimization, our system emphasizes aggressive memory optimizations and ensures that all stages of differentiation, including collision handling and sparse Hessian updates, execute entirely on the GPU. The two approaches are complementary and our system could serve as a backend for symbolic frontends such as Herholz et al. (2024), combining expression-level optimization with GPU-efficient sparse differentiation.

In summary, while several mesh-based AD tools support GPU execution, many overlook the importance of memory locality. Some focus solely on symbolic simplification (Herholz et al., 2022, 2024), or lack higher-order derivatives (e.g., Opt, Thallo, Warp), and some run only on the CPU (Schmidt et al., 2022). In contrast, our method treats the GPU as a first-class target and optimizes all AD operations with an emphasis on memory locality and execution efficiency.

2.4. RXMesh Overview

RXMesh (Mahmoud et al., 2021, 2025) is a GPU system for triangle mesh processing that supports both static meshes and dynamic workloads that modify connectivity at runtime. Its core idea is to partition the mesh into small patches sized to fit the GPU memory hierarchy well so that most local computation can be carried out from fast on-chip shared memory rather than global memory. To preserve locality near patch boundaries, RXMesh augments each patch with ghost elements, called ribbons, that cache neighboring out-of-patch mesh data. Each patch is encoded independently using compact sparse matrix representations of face-edge and edge-vertex incidence. Computation is organized at the patch level, with one CUDA block assigned to a patch so that threads cooperate on local queries with reduced divergence and improved load balance.

A ribbon element is treated as an indirection to its owner patch rather than as a locally stored element. Any access to its connectivity or attributes is resolved by first identifying the owner patch and the element’s local index within that patch, and then reading the data from the owner’s storage. This indirection is realized through a hash tables that store enough information to recover the ribbon element’s owner patch and its local index within that patch. Any subsequent connectivity or attribute access to the ribbon element is then forwarded to the owner’s storage.

3. System Overview

At a high level, our system provides a way for the user to define mesh-based objectives by composing many local terms while the system takes responsibility for evaluating, differentiating, and assembling these terms efficiently on the GPU. From the user’s perspective, an objective is built incrementally by specifying local energy or constraint contributions over mesh elements or element neighborhoods. These terms may depend on mesh adjacency information or on dynamically generated interactions, e.g., proximity-based or collision-driven couplings. The user expresses only the local computation, which then can be evaluated passively, differentiated to produce gradients and sparse Hessians/Jacobian, or used in matrix-free form for Hessian-vector products.

The novelty of our system lies in how this specification is realized into efficient GPU programs. While most of the computations required for derivative evaluation are local to individual mesh elements, existing AD systems force users into an uncomfortable choice: (1) rely on general-purpose frameworks that fail to preserve sparsity and therefore make second-order derivatives expensive, or (2) hand-derive gradients and Hessians tailored to a specific problem.

To bridge the gap between local problem descriptions and high-performance implementations, our system transforms the user’s specification into an efficient GPU code. Our system analyzes the structure of each local term to determine its stencil, dimensionality, and derivative requirements. We use this information to define how local computations map to global degrees of freedom and to preallocate the global data structures needed for sparse gradients, Jacobians, and Hessians. At runtime, local terms are evaluated independently and in parallel across the mesh. Each term computes its contribution using a compact local state, produces dense local derivatives, and contributes these results to global sparse structures (Figure 3), or applies them directly in matrix-free form. When interactions introduce new couplings during execution, the system updates sparsity patterns on the GPU without rebuilding global computation graphs or CPU–GPU transferring data.

Internally, we build our system around patch-level execution, where the mesh is partitioned into small, independent patches that can be processed cooperatively within a GPU thread block. Each patch fits in shared memory and provides a bounded working set for both primal evaluation and differentiation. This allows all intermediates required for forward-mode AD, including temporary dual values and local Jacobians or Hessians, to be generated and consumed locally without spilling to global memory. We build on RXMesh’s (Mahmoud et al., 2021) patch-based execution model to achieve this locality but extend it substantially to support automatic differentiation, dynamic interaction terms, and sparse derivative assembly. While patch-based execution has been explored for efficient mesh processing (Mahmoud et al., 2025; Yu et al., 2022), our work is the first to use patches for AD, demonstrating that patch locality a key factor of efficient sparse derivative computation on GPUs.

4. Programming Model

We target mesh-based problems in which the objective decomposes into a sum of local functions defined over mesh elements. Such problems arise in simulation, optimization, and geometric data processing, where energies, constraints, or residuals are associated with vertices, edges, faces, or small neighborhoods thereof. A key property of these problems is that they are partially separable (Nocedal and Wright, 2006), i.e., each local term depends only on a small subset of the global degrees of freedom.

4.1. Problem Overview

Scalar-valued objectives.

We first consider scalar-valued energy function $F:\mathbb{R}^{n}\rightarrow\mathbb{R}$ , of the form

[TABLE]

where $\mathbb{E}$ denotes a set of mesh elements (e.g., vertices, edges, or faces), and each $f_{j}:\mathbb{R}^{k_{j}}\rightarrow\mathbb{R}$ is a localized energy term associated with element $j$ . Here, $x\in\mathbb{R}^{n}$ encodes the global degrees of freedom, while $x_{j}\in\mathbb{R}^{k_{j}}$ collects only those variables that influence $f_{j}$ . The size $k_{j}$ is determined by the element’s local neighborhood and is typically small and bounded (e.g., three vertices for a face, two for an edge).

This local dependency can be expressed using a binary selection matrix $S_{j}\in\{0,1\}^{k_{j}\times n}$ such that $x_{j}=S_{j}x.$ Differentiation distributes over summation, allowing each local term to be differentiated independently. Specifically, we compute per-element gradients $g_{j}\in\mathbb{R}^{k_{j}}$ and Hessians $H_{j}\in\mathbb{R}^{k_{j}\times k_{j}}$ , which are assembled into global structures via

[TABLE]

Vector-valued functions and sparse Jacobians.

In addition to scalar objectives, our programming model also supports vector-valued functions $F:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ , which commonly arise in nonlinear least-squares problems and constraint formulations. We assume $F$ decomposes as a sum of local vector-valued terms, as in Equation 1. Here, each $f_{j}:\mathbb{R}^{k_{j}}\rightarrow\mathbb{R}^{m_{j}}$ produces a small vector-valued local contribution associated with element $j$ . The total output dimension is $m=\sum_{j}m_{j}$ .

Differentiating $F$ yields a sparse Jacobian $J\in\mathbb{R}^{m\times n}$ . Each local term contributes a dense local Jacobian $J_{j}\in\mathbb{R}^{m_{j}\times k_{j}}$ , which is assembled into the global Jacobian via

[TABLE]

where $P_{j}\in\{0,1\}^{m_{j}\times m}$ is a binary selection matrix that maps the $m_{j}$ local residuals of term $j$ to their locations in the global output vector. As in the scalar case, each $J_{j}$ is small and dense while the global Jacobian is sparse with a structure determined by mesh connectivity.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Ansel et al . (2024) Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary De Vito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael Lazos, Mario Lezcano, Yanbo Liang, Jason Liang, Yinghai Lu, CK Luk, Bert Maher, Yunjie Pan, Christian Puhrsch, Matthias · doi ↗
3Blondel et al . (2022) Mathieu Blondel, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-López, Fabian Pedregosa, and Jean-Philippe Vert. 2022. Efficient and Modular Implicit Differentiation. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS ’22, Vol. 35) . Curran Associates, Inc., Article 378, 13 pages. doi: 10.5555/3600270.3600648 · doi ↗
4Bradbury et al . (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: composable transformations of Python+Num Py programs . http://github.com/jax-ml/jax
5Brisson (1989) Erik Brisson. 1989. Representing Geometric Structures in d d Dimensions: Topology and Order. In Proceedings of the Fifth Annual Symposium on Computational Geometry (Saarbruchen, West Germany) (SCG ’89) . Association for Computing Machinery, New York, NY, USA, 218–227. doi: 10.1145/73833.73858 · doi ↗
6Devito et al . (2017) Zachary Devito, Michael Mara, Michael Zollhöfer, Gilbert Bernstein, Jonathan Ragan-Kelley, Christian Theobalt, Pat Hanrahan, Matthew Fisher, and Matthias Niessner. 2017. Opt: A Domain Specific Language for Non-Linear Least Squares Optimization in Graphics and Imaging. ACM Trans. Graph. 36, 5, Article 171 (Oct. 2017), 27 pages. doi: 10.1145/3132188 · doi ↗
7Diamanti et al . (2015) Olga Diamanti, Amir Vaxman, Daniele Panozzo, and Olga Sorkine-Hornung. 2015. Integrable Poly Vector Fields. ACM Trans. Graph. 34, 4, Article 38 (July 2015), 12 pages. doi: 10.1145/2766906 · doi ↗
8Fernández-Fernández et al . (2025) José Antonio Fernández-Fernández, Fabian Löschner, Lukas Westhofen, Andreas Longva, and Jan Bender. 2025. Sym X: Energy-based Simulation from Symbolic Expressions. ACM Trans. Graph. 45, 1, Article 5 (Oct. 2025), 19 pages. doi: 10.1145/3764928 · doi ↗