Elucidating the Role of Feature Normalization in IJEPA
Adam Colton

TL;DR
This paper investigates how feature normalization in IJEPA affects token energy hierarchy and proposes replacing it with DynTanh, leading to improved accuracy and artifact reduction in self-supervised visual learning.
Contribution
The paper identifies the disruptive effect of feature layer normalization in IJEPA and introduces DynTanh as a better alternative to preserve token energies.
Findings
Replacing LN with DynTanh improves ImageNet accuracy from 38% to 42.7%.
DynTanh reduces checkerboard artifacts in loss maps.
Preserving token energy hierarchy enhances self-supervised learning effectiveness.
Abstract
In the standard image joint embedding predictive architecture (IJEPA), features at the output of the teacher encoder are layer normalized (LN) before serving as a distillation target for the student encoder and predictor. We propose that this feature normalization disrupts the natural energy hierarchy of visual tokens, where high-energy tokens (those with larger L2 norms) encode semantically important image regions. LN forces all features to have identical L2 norms, effectively equalizing their energies and preventing the model from prioritizing semantically rich regions. We find that IJEPA models trained with feature LN exhibit loss maps with significant checkerboard-like artifacts. We propose that feature LN be replaced with a DynTanh activation as the latter better preserves token energies and allows high-energy tokens to greater contribute to the prediction loss. We show that IJEPA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
