On the linearity of large non-linear models: when and why the tangent   kernel is constant

Chaoyue Liu; Libin Zhu; Mikhail Belkin

arXiv:2010.01092·cs.LG·February 23, 2021·33 cites

On the linearity of large non-linear models: when and why the tangent kernel is constant

Chaoyue Liu, Libin Zhu, Mikhail Belkin

PDF

Open Access 1 Video

TL;DR

This paper investigates why large neural networks become linear as they grow wider, linking this to Hessian scaling, and shows that the tangent kernel's constancy is not universal nor essential for training success.

Contribution

It introduces a Hessian scaling framework explaining tangent kernel constancy and clarifies conditions under which neural networks do not transition to linearity.

Findings

01

Transition to linearity depends on Hessian norm scaling with width

02

Constant tangent kernel is not universal for all wide networks

03

Non-linear last layers prevent the transition to linearity

Abstract

The goal of this work is to shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We present a general framework for understanding the constancy of the tangent kernel via Hessian scaling applicable to the standard classes of neural networks. Our analysis provides a new perspective on the phenomenon of constant tangent kernel, which is different from the widely accepted "lazy training". Furthermore, we show that the transition to linearity is not a general property of wide neural networks and does not hold when the last layer of the network is non-linear. It is also not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the linearity of large non-linear models: when and why the tangent kernel is constant· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Model Reduction and Neural Networks