TL;DR
SeeDNorm introduces a dynamic normalization method that adjusts scaling based on input norms, improving model performance in large language models and vision tasks without significant additional parameters.
Contribution
It proposes a novel self-rescaled normalization layer that preserves input norm information and adapts scaling dynamically, enhancing model capacity and robustness.
Findings
Outperforms RMSNorm and LayerNorm in various tasks
Maintains stability during training with minimal parameter increase
Improves zero-shot and distributional shift performance
Abstract
Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling coefficient to maintain the representational capacity of the model. However, RMSNorm discards the input norm information in forward pass and a static scaling factor may be insufficient to accommodate the wide variability of input data and distributional shifts, thereby limiting further performance improvements, particularly in zero-shot scenarios that large language models routinely encounter. To address this limitation, we propose SeeDNorm, which enhances the representational capability of the model by dynamically adjusting the scaling coefficient based on the current input, thereby preserving the input norm information and enabling…
Peer Reviews
Decision·ICLR 2026 Poster
1. SeeDNorm offers improved preservation of input norm information and adaptability to shifts in input distribution. Unlike RMSNorm, which disregards input norms, SeeDNorm retains some of the "scale" information through a dynamic term. This scaling, being dependent on the input, allows it to better manage varying magnitudes or changes in the domain. 2. Experiments conducted across language tasks demonstrate that SeeDNorm accelerates convergence and produces better final results compared to the
1. For extremely large models or those with many layers of normalization, the benefits of SeeDNorm may be limited. The paper indicates that the improvements diminish for dense models compared to Mixture of Experts (MoE) models. 2. Although SeeDNorm shows consistent improvement, the gains are often subtle, typically just a few tenths of a percent in accuracy, particularly for dense models, as illustrated in Table 2. 3. While the paper broadly discusses "vision tasks," it does not assess tasks t
The paper is technically detailed and the motivation is clear: existing normalization layers such as RMSNorm provide stability but lose information about the input magnitude. SeeDNorm offers a straightforward extension that dynamically rescales activations conditioned on the current input, improving adaptability to data variability and distributional shifts. The theoretical analysis is comprehensive, covering forward and backward derivations as well as scaling invariance, and helps clarify why t
Although the results are strong, the contribution may appear incremental because SeeDNorm can be viewed as a combination of RMSNorm and DyT that merges their strengths while mitigating DyT’s vanishing gradient issue. The proposed dynamic rescaling mechanism resembles existing modulation approaches such as gating or FiLM-style conditioning, differing mainly in how the rescaling term is computed. The theoretical analysis mainly focuses on stability and gradient behavior but gives limited insight i
- I appreciate the large scope of the experiments: The paper presents extensive studies across tasks and scale, spanning LLM pretraining (with normal and MoE models), image classification, and image generation. - SeeDNorm either matches or outperforms other normalization variants. The strongest benefits are observed for MoE language models - The ablations are interesting and comprehensive. They answer several questions that came up when reading through the paper - the paper is well-written and
- A few experimental details are missing, and it is unclear if default training scripts or modified scripts have been used (more below in questions) - For Vision experiments, especially image classification, the benefits of SeeDNorm are less clear. In Table 2, the numbers are either identical or extremely close to baselines (yet the SeedNorm variant is bold - why?). Further, the default SeedNorm variant fails to converge (Table 3), and Multihead SeedNorm is required to match or bring gains over
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
