# Microscopic and collective signatures of feature learning in neural networks

**Authors:** Andrea Corti, Rosalba Pacelli, Pietro Rotondo, Marco Gherardi

arXiv: 2508.20989 · 2025-08-29

## TL;DR

This paper uses a statistical-mechanics approach to analyze how neural networks learn features, revealing both collective and microscopic signatures of feature learning in a large-width, large-data regime.

## Contribution

It provides analytical insights into feature learning mechanisms in neural networks within a Bayesian framework, connecting microscopic weight changes and manifold geometry.

## Key findings

- Distance between class manifolds varies nonmonotonically with temperature
- Microscopic parameters undergo data-dependent displacements and develop correlations
- Feature learning signatures are present even when the posterior resembles Gaussian process regression

## Abstract

Feature extraction - the ability to identify relevant properties of data - is a key factor underlying the success of deep learning. Yet, it has proved difficult to elucidate its nature within existing predictive theories, to the extent that there is no consensus on the very definition of feature learning. A promising hint in this direction comes from previous phenomenological observations of quasi-universal aspects in the training dynamics of neural networks, displayed by simple properties of feature geometry. We address this problem within a statistical-mechanics framework for Bayesian learning in one hidden layer neural networks with standard parameterization. Analytical computations in the proportional limit (when both the network width and the size of the training set are large) can quantify fingerprints of feature learning, both collective ones (related to manifold geometry) and microscopic ones (related to the weights). In particular, (i) the distance between different class manifolds in feature space is a nonmonotonic function of the temperature, which we interpret as the equilibrium counterpart of a phenomenon observed under gradient descent (GD) dynamics, and (ii) the microscopic learnable parameters in the network undergo a finite data-dependent displacement with respect to the infinite-width limit, and develop correlations. These results indicate that nontrivial feature learning is at play in a regime where the posterior predictive distribution is that of Gaussian process regression with a trivially rescaled prior.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20989/full.md

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20989/full.md

## References

56 references — full list in the complete paper: https://tomesphere.com/paper/2508.20989/full.md

---
Source: https://tomesphere.com/paper/2508.20989