Adaptive Regularization for Large-Scale Sparse Feature Embedding Models

Mang Li; Wei Lyu

arXiv:2511.06374·cs.LG·January 28, 2026

Adaptive Regularization for Large-Scale Sparse Feature Embedding Models

Mang Li, Wei Lyu

PDF

Open Access 3 Reviews

TL;DR

This paper investigates why large-scale sparse feature models overfit quickly during training, provides a theoretical explanation using Rademacher complexity, and proposes an adaptive regularization method that improves performance and prevents overfitting.

Contribution

It offers a theoretical understanding of overfitting in sparse models and introduces an adaptive regularization technique that enhances training stability and performance.

Findings

01

Theoretical explanation of overfitting using Rademacher complexity.

02

Adaptive regularization prevents performance degradation in multi-epoch training.

03

Method improves model performance within a single epoch and is deployed in production.

Abstract

The one-epoch overfitting problem has drawn widespread attention, especially in CTR and CVR estimation models in search, advertising, and recommendation domains. These models which rely heavily on large-scale sparse categorical features, often suffer a significant decline in performance when trained for multiple epochs. Although recent studies have proposed heuristic solutions, the fundamental cause of this phenomenon remains unclear. In this work, we present a theoretical explanation grounded in Rademacher complexity, supported by empirical experiments, to explain why overfitting occurs in models with large-scale sparse categorical features. Based on this analysis, we propose a regularization method that constrains the norm budget of embedding layers adaptively. Our approach not only prevents the severe performance degradation observed during multi-epoch training, but also improves…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

A substantive assessment of the strengths of the paper, touching on each of the following dimensions: originality, quality, clarity, and significance. We encourage reviewers to be broad in their definitions of originality and significance. For example, originality may arise from a new definition or problem formulation, creative combinations of existing ideas, application to a new domain, or removing limitations from prior results. 1. Sound and well-motivated analysis linking embedding sparsity

Weaknesses

1. The analysis assumes i.i.d. sampling and bounded feature norms, which may not hold in practice for long-tail industrial data. 2. The derivation of adaptive coefficients relies on approximating feature frequency by occurrence intervals; this estimation may introduce stochastic noise not formally analyzed. 3. The differentiability of \phi(\tau_{ij}) and the KKT-based derivation rest on idealized smoothness assumptions that may not hold in real deep networks. 4. The paper doesn’t rigorously stud

Reviewer 02Rating 8Confidence 4

Strengths

* Problem importance. The paper tackles an industry-recognized issue in large-scale CTR models—a core component of nearly all advertising systems. * Theoretical grounding. It links the empirical “train only one epoch” practice to a capacity-control perspective, giving the phenomenon a clearer theoretical justification. * Practical simplicity. The proposed technique is easy to integrate: it only requires tracking the last valid update step for each embedding row and can be directly plugged into e

Weaknesses

1. The connection to existing “frequency-aware” or “epoch-level reset” approaches (e.g., MEDA-style methods) is not elaborated systematically. 2. The experiments mainly report global metrics such as AUC/Logloss, but lack fine-grained bucket analyses (by feature frequency, long-tail IDs, cold-start features) to directly demonstrate that “low-frequency rows were actually controlled.”

Reviewer 03Rating 6Confidence 4

Strengths

* Theoretical Insights: The paper offers a coherent theoretical framework using Rademacher complexity to explain one-epoch overfitting in sparse embeddings, clarifying a widely observed yet under-theorized phenomenon in deep CTR/CVR models (see Section 2.2 and 3). * Principled Adaptive Method: Instead of relying on heuristics, the proposed adaptive regularization derives directly from the theoretical analysis, offering an intuitive and practical approach that is compatible with common optimizers

Weaknesses

* Limited Positioning Relative to Broader Regularization Literature: While the paper covers baseline regularization approaches (L1, L2, weight decay), it does not sufficiently engage with recent or foundational work on sparsity-driven regularization at both the architectural and optimization levels. For instance, no discussion is provided relating their adaptive method to pruning/growth strategies or thresholding-based approaches, even though such techniques are central to the sparsity literatur

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Information Retrieval and Search Behavior · Text and Document Classification Technologies