Linear programming using diagonal linear networks

Haoyue Wang; Promit Ghosal; Rahul Mazumder

arXiv:2310.02535·math.OC·October 5, 2023

Linear programming using diagonal linear networks

Haoyue Wang, Promit Ghosal, Rahul Mazumder

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel framework using diagonal linear networks trained with gradient descent to solve entropically regularized linear programming problems, highlighting the impact of initialization on regularization strength and demonstrating linear convergence.

Contribution

It presents the first comprehensive framework leveraging diagonal neural networks for linear programming, analyzing convergence and the role of initialization in regularization.

Findings

01

Training diagonal networks yields solutions to regularized LP problems.

02

Convergence occurs at a linear rate under mild assumptions.

03

Initialization influences the regularization strength.

Abstract

Linear programming has played a crucial role in shaping decision-making, resource allocation, and cost reduction in various domains. In this paper, we investigate the application of overparametrized neural networks and their implicit bias in solving linear programming problems. Specifically, our findings reveal that training diagonal linear networks with gradient descent, while optimizing the squared $L_{2}$ -norm of the slack variable, leads to solutions for entropically regularized linear programming problems. Remarkably, the strength of this regularization depends on the initialization used in the gradient descent process. We analyze the convergence of both discrete-time and continuous-time dynamics and demonstrate that both exhibit a linear rate of convergence, requiring only mild assumptions on the constraint matrix. For the first time, we introduce a comprehensive framework for…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

This study's approach, reparameterizing GD for linear programs, offering a novel perspective on the implicit bias of GD. The quality is demonstrated through rigorous mathematical analyses, which include bounding the iterates of the algorithm and characterizing the limit points of the convergence. Clarity is another strength, with the paper presenting its methodology and findings in a structured and understandable manner.

Weaknesses

- a notable weakness is the limited scope of the experimental setup; their simulation relies on isotropic Gaussian features, which may not be representative of real datasets that often contain features with varying scales and correlations. Moreover, the paper does not discuss the impact of non-Gaussian noise or different initialization schemes, which could potentially affect the generalization of the results - can the author provide some results on real-world benchmarks? If conditions permit,

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

The theoretical results are new and elegant, I am not aware of previous results connecting linear programming and diagonal linear networks.

Weaknesses

- Section 2.2 and 2.3 are a little bit scientifically "loose", it draws some connections with other methods, but I am not sure there are proper theoretical results that can be extracted from these parts - Experiments. I understand this is a theoretical paper, but I think the paper would have more impact with more extensive experiments. The current experimental section illustrates the linear convergence of the algorithm and has one comparison to the mirror descent. Maybe the authors could provid

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

The extension of the previous results [1,2] on reparametrized gradient descent/flow under minimal assumptions on data and macroscopic choice of learning rate is a nontrivial result. The analysis under this generality introduces additional complexity, but this work successfully establishes (linear) convergence for this setting. [1] Woodworth, B.E., Gunasekar, S., Lee, J., Moroshko, E., Savarese, P.H., Golan, I., Soudry, D., & Srebro, N. (2019). Kernel and Rich Regimes in Overparametrized Models.

Weaknesses

The presentation can be further improved in several ways to enhance clarity and readability: * Firstly, I could not follow how the similarity between Algorithm 1 and the Sinkhorn algorithm is used in the paper. * Additionally, to the best of my knowledge, the result in [1] proves a similar result in the manuscript in a fairly general setting too. I think it would be helpful for the readers if the authors could elaborate on these points more in their revised manuscript. Another aspect that req

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and Algorithms