Convergence of Two-Layer Regression with Nonlinear Units
Yichuan Deng, Zhao Song, Shenghao Xie

TL;DR
This paper analyzes the convergence properties of a two-layer regression model with nonlinear units, specifically focusing on softmax and ReLU functions, providing theoretical guarantees and an efficient algorithm.
Contribution
It introduces a closed-form Hessian for the ReLU regression problem, proves its properties, and proposes a convergent approximate Newton algorithm.
Findings
Hessian of the loss function is explicitly derived.
Hessian is shown to be Lipschitz continuous and PSD under certain conditions.
The proposed greedy algorithm converges to the optimal solution.
Abstract
Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax ReLU regression problem. Generally speaking, our goal is to find an optimal solution to the regression problem involving the ReLU unit. In this work, we calculate a close form representation for the Hessian of the loss function. Under certain assumptions, we prove the Lipschitz continuous and the PSDness of the Hessian. Then, we introduce an greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution. Last, We relax the Lipschitz condition and prove the convergence in the sense of loss value.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and ELM · Machine Learning and Data Classification · Advanced Bandit Algorithms Research
MethodsSoftmax
