Adaptive Regularization for Large-Scale Sparse Feature Embedding Models
Mang Li, Wei Lyu

TL;DR
This paper investigates why large-scale sparse feature models overfit quickly during training, provides a theoretical explanation using Rademacher complexity, and proposes an adaptive regularization method that improves performance and prevents overfitting.
Contribution
It offers a theoretical understanding of overfitting in sparse models and introduces an adaptive regularization technique that enhances training stability and performance.
Findings
Theoretical explanation of overfitting using Rademacher complexity.
Adaptive regularization prevents performance degradation in multi-epoch training.
Method improves model performance within a single epoch and is deployed in production.
Abstract
The one-epoch overfitting problem has drawn widespread attention, especially in CTR and CVR estimation models in search, advertising, and recommendation domains. These models which rely heavily on large-scale sparse categorical features, often suffer a significant decline in performance when trained for multiple epochs. Although recent studies have proposed heuristic solutions, the fundamental cause of this phenomenon remains unclear. In this work, we present a theoretical explanation grounded in Rademacher complexity, supported by empirical experiments, to explain why overfitting occurs in models with large-scale sparse categorical features. Based on this analysis, we propose a regularization method that constrains the norm budget of embedding layers adaptively. Our approach not only prevents the severe performance degradation observed during multi-epoch training, but also improves…
Peer Reviews
Decision·ICLR 2026 Poster
A substantive assessment of the strengths of the paper, touching on each of the following dimensions: originality, quality, clarity, and significance. We encourage reviewers to be broad in their definitions of originality and significance. For example, originality may arise from a new definition or problem formulation, creative combinations of existing ideas, application to a new domain, or removing limitations from prior results. 1. Sound and well-motivated analysis linking embedding sparsity
1. The analysis assumes i.i.d. sampling and bounded feature norms, which may not hold in practice for long-tail industrial data. 2. The derivation of adaptive coefficients relies on approximating feature frequency by occurrence intervals; this estimation may introduce stochastic noise not formally analyzed. 3. The differentiability of \phi(\tau_{ij}) and the KKT-based derivation rest on idealized smoothness assumptions that may not hold in real deep networks. 4. The paper doesn’t rigorously stud
* Problem importance. The paper tackles an industry-recognized issue in large-scale CTR models—a core component of nearly all advertising systems. * Theoretical grounding. It links the empirical “train only one epoch” practice to a capacity-control perspective, giving the phenomenon a clearer theoretical justification. * Practical simplicity. The proposed technique is easy to integrate: it only requires tracking the last valid update step for each embedding row and can be directly plugged into e
1. The connection to existing “frequency-aware” or “epoch-level reset” approaches (e.g., MEDA-style methods) is not elaborated systematically. 2. The experiments mainly report global metrics such as AUC/Logloss, but lack fine-grained bucket analyses (by feature frequency, long-tail IDs, cold-start features) to directly demonstrate that “low-frequency rows were actually controlled.”
* Theoretical Insights: The paper offers a coherent theoretical framework using Rademacher complexity to explain one-epoch overfitting in sparse embeddings, clarifying a widely observed yet under-theorized phenomenon in deep CTR/CVR models (see Section 2.2 and 3). * Principled Adaptive Method: Instead of relying on heuristics, the proposed adaptive regularization derives directly from the theoretical analysis, offering an intuitive and practical approach that is compatible with common optimizers
* Limited Positioning Relative to Broader Regularization Literature: While the paper covers baseline regularization approaches (L1, L2, weight decay), it does not sufficiently engage with recent or foundational work on sparsity-driven regularization at both the architectural and optimization levels. For instance, no discussion is provided relating their adaptive method to pruning/growth strategies or thresholding-based approaches, even though such techniques are central to the sparsity literatur
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Information Retrieval and Search Behavior · Text and Document Classification Technologies
