Improving Language Modelling with Noise-contrastive estimation
Farhana Ferdousi Liza, Marek Grzes

TL;DR
This paper demonstrates that with proper hyperparameter tuning and a new learning rate schedule, noise-contrastive estimation can effectively scale neural language models to large vocabularies, outperforming existing methods.
Contribution
It introduces the 'search-then-converge' learning rate schedule and provides hyperparameter tuning guidelines for NCE in neural language modeling.
Findings
NCE can outperform state-of-the-art models with proper tuning.
The 'search-then-converge' schedule improves NCE training stability.
Hyperparameters like dropout and initialization significantly affect NCE performance.
Abstract
Neural language models do not scale well when the vocabulary is large. Noise-contrastive estimation (NCE) is a sampling-based method that allows for fast learning with large vocabularies. Although NCE has shown promising performance in neural machine translation, it was considered to be an unsuccessful approach for language modelling. A sufficient investigation of the hyperparameters in the NCE-based neural language models was also missing. In this paper, we showed that NCE can be a successful approach in neural language modelling when the hyperparameters of a neural network are tuned appropriately. We introduced the 'search-then-converge' learning rate schedule for NCE and designed a heuristic that specifies how to use this schedule. The impact of the other important hyperparameters, such as the dropout rate and the weight initialisation range, was also demonstrated. We showed that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDropout
