NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks
Nandan Kumar Jha, Brandon Reagen

TL;DR
NerVE introduces a spectral analysis framework to understand how large language model feed-forward networks organize information, revealing the impact of nonlinearities, optimizer geometry, and architecture on latent space dynamics and generalization.
Contribution
This work presents NerVE, a novel eigenspectral framework that efficiently tracks FFN spectral dynamics, linking them to model design choices and generalization, across various architectures and training setups.
Findings
Spectral signatures correlate with model generalization.
Nonlinearities reinject variance across eigenmodes.
Optimizer geometry influences spectral dynamics.
Abstract
We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection. We validate NerVE across model scales, and diverse…
Peer Reviews
Decision·ICLR 2026 Poster
- Provides a systematic, spectral lens to study FFN dynamics, an often-overlooked but important component of transformer models. - The four complementary eigenspectrum metrics are theoretically interpretable, and capture distinct aspects of the representations - The paper covers normalization variants, activation types, etc, making the analysis framework more broadly applicable.
- If I understand correctly, the paper treats activations from different sequences as interchangeable. In that case, what aspects of the analysis are specific to LLMs or transformer architectures? From this perspective, it may be valuable to also examine other types of models with FFNs and compare their behavior to that of LLMs. Alternatively, extending the analysis to explicitly account for the sequential structure of tokens could yield further insights. - The models analyzed in the paper are r
1. The paper's core claim—that FFNs function as spectral reshapers to re-awaken inactive dimensions is a compelling and intuitive explanation for their role. It provides a strong conceptual model that moves beyond viewing FFNs as simple key-value memories. 2. The chosen suite of four metrics is a strength. While SE and PR are related, the addition of EEE (to distinguish between different types of flat spectra) and, crucially, JS Divergence (to quantify the nonlinearity's effect) provides a more
1. The paper makes claims about "LLMs" but bases its findings on very small models (70M-130M). These spectral dynamics are not guaranteed to hold at the 1B+ parameter scales where architectural optimization is most critical. The findings need to be validated on larger models. 2. The paper repeatedly shows that "healthy" spectra (high PR, low EEE) correlate with low validation loss but fails to prove causation. It's just as likely that a well-optimized model produces these spectra as a byproduct
- The topic of studying the effect of FFNs in LLMs is interesting and the idea of using the eigenspectrum dynamics to do so is well motivated. - Several results have interesting insights such as in section 3.1, the FFN nonlinearity redistributes the variance across various eigenvalues. - Studying several architectural choices such as layernorms and their positioning or spectral norm is interesting (although also see "weaknesses" about how thorough this is).
1. Overall, I think the paper identifies an interesting area but feels somewhat underdeveloped. Some particular areas where this comes across: a. There doesn't seem to be a consistent message across section 3.2-3.5. Each of the subsections looks at a different architectural choice but the conclusions don't seem to tie together, e.g. how does the finding that rope prevents mid-to-deep spectral connect to the positioning of LN? For me the most interesting part of these sections was in section 3.4
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Advanced Graph Neural Networks · Generative Adversarial Networks and Image Synthesis
