Local Convergence of Approximate Newton Method for Two Layer Nonlinear   Regression

Zhihang Li; Zhao Song; Zifan Wang; Junze Yin

arXiv:2311.15390·cs.LG·November 28, 2023·1 cites

Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression

Zhihang Li, Zhao Song, Zifan Wang, Junze Yin

PDF

Open Access

TL;DR

This paper analyzes the local convergence of an approximate Newton method for training a two-layer nonlinear regression model with a softmax-activated first layer, providing theoretical guarantees and complexity analysis.

Contribution

It introduces a novel analysis of a two-layer regression with softmax activation, establishing local convergence guarantees for an approximate Newton method.

Findings

01

Loss function Hessian is positive definite and Lipschitz continuous.

02

Algorithm converges to an $oldsymbol{ extepsilon}$-approximate minimizer in $O( ext{log}(1/ extepsilon))$ iterations.

03

Each iteration requires $ ilde{O}( ext{nnz}(C) + d^ extomega)$ time.

Abstract

There have been significant advancements made by large language models (LLMs) in various aspects of our daily lives. LLMs serve as a transformative force in natural language processing, finding applications in text generation, translation, sentiment analysis, and question-answering. The accomplishments of LLMs have led to a substantial increase in research efforts in this domain. One specific two-layer regression problem has been well-studied in prior works, where the first layer is activated by a ReLU unit, and the second layer is activated by a softmax unit. While previous works provide a solid analysis of building a two-layer regression, there is still a gap in the analysis of constructing regression problems with more than two layers. In this paper, we take a crucial step toward addressing this problem: we provide an analysis of a two-layer regression problem. In contrast to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Matrix Theory and Algorithms · Machine Learning and ELM

MethodsSoftmax