Sobolev Training of End-to-End Optimization Proxies
Andrew W. Rosemberg, Joaquim Dias Garcia, Russell Bent, Pascal Van Hentenryck

TL;DR
This paper introduces Sobolev training for optimization proxies, integrating solver sensitivities to improve approximation accuracy in both supervised and self-supervised settings, with significant empirical improvements on power flow and portfolio tasks.
Contribution
It develops a Sobolev training paradigm that incorporates derivative information into optimization proxies, enhancing their accuracy and reliability in large-scale optimization problems.
Findings
Supervised Sobolev training reduces mean squared error by up to 56%.
It decreases median worst case constraint violation by up to 400%.
Self supervised Sobolev training halves the optimality gap in certain portfolio tasks.
Abstract
Optimization proxies - machine learning models trained to approximate the solution mapping of parametric optimization problems in a single forward pass - offer dramatic reductions in inference time compared to traditional iterative solvers. This work investigates the integration of solver sensitivities into such end to end proxies via a Sobolev training paradigm and does so in two distinct settings: (i) fully supervised proxies, where exact solver outputs and sensitivities are available, and (ii) self supervised proxies that rely only on the objective and constraint structure of the underlying optimization problem. By augmenting the standard training loss with directional derivative information extracted from the solver, the proxy aligns both its predicted solutions and local derivatives with those of the optimizer. Under Lipschitz continuity assumptions on the true solution mapping,…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This is one of the first works in ML for optimization that uses Solobev training techniques. I consider this as an innovation for the research field. The sparse masking technique is novel and practically impactful. 2. Theoretical analysis shows that it helps with reducing sample complexity and improve generalization. 3. Empirical evaluation on two real-world domain demonstrates the effectiveness of the framework.
1. Jacobian computation in the framework produces extra overhead during training. The authors discussed how the overhead could be reduced, but it is still unclear how it compares to the benchmark method. 2. The results for portfolio optimization is not promising from a practical point of view. Since most funds / asset management firms in real world operates under tight risk in their portfolio management. Overall, portfolio optimization is not the best use case for studying optimization proxies,
- Incorporating solver sensitivities into the loss provides a conceptually intuitive way to teach the network about local geometric structure. The proposed sparse masking strategy is simple, effective, and well-motivated: it reduces memory usage, mitigates gradient conflicts, and ablation studies demonstrate that high sparsity levels (keeping only 5–25% of entries) yield the best performance. - Experimental results show consistent reductions in mean squared error and, notably, in worst-case inf
- While the paper provides bounds for value-only, Jacobian-only, and joint Sobolev training individually, there is no theoretical guarantee that joint Sobolev training consistently achieves a smaller error gap than value-only training. - Although Sobolev training substantially improves MSE and infeasibility metrics compared to the benchmark, the opposite trend is observed for the optimality gap. This suggests that Sobolev may produce solutions that are close to optimal in Euclidean space but no
The paper is conceptually original in combining Sobolev training with the learning of end-to-end optimization proxies. This bridges a clear gap between differentiable optimization and learning-based surrogate modeling. The idea of using solver sensitivities or implicit KKT derivatives as gradient supervision is elegant and well-motivated. On the technical side, the theoretical analysis is rigorous and clearly articulated. The authors carefully state the regularity assumptions (LICQ, SOSC) and d
While the paper is theoretically sound and conceptually elegant, its main limitation lies in the modest empirical improvement. The quantitative gains over value-based baselines are small, and the claimed advantage in improving feasibility is not clearly demonstrated. The authors emphasize feasibility as a key benefit, yet the current metrics only partially capture this aspect. If feasibility improvement is indeed the main contribution, the paper would benefit from more comprehensive metrics and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Multi-Objective Optimization Algorithms · Advanced Optimization Algorithms Research · Model Reduction and Neural Networks
MethodsSparse Evolutionary Training
