Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits
Yan Li, Dhruv Choudhary, Xiaohan Wei, Baichuan Yuan, Bhargav, Bhushanam, Tuo Zhao, Guanghui Lan

TL;DR
This paper introduces a frequency-aware SGD algorithm that leverages token frequency information to improve convergence speed in embedding learning, with theoretical guarantees and empirical validation on recommendation tasks.
Contribution
It proposes a novel frequency-dependent learning rate for SGD, providing the first provable improvements for non-convex embedding problems and explaining the success of adaptive methods.
Findings
Frequency-aware SGD achieves provable speed-up over standard SGD.
The proposed method matches or surpasses adaptive algorithms on benchmark tasks.
Token frequency information is implicitly exploited by existing adaptive algorithms.
Abstract
Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains. To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, largely accredited to their token-dependent learning rate. However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored. We show that incorporating frequency information of tokens in the embedding learning problems leads to provably efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent. Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and ELM · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
MethodsStochastic Gradient Descent
