Stabilized training of joint energy-based models and their practical applications
Martin Sustek, Samik Sadhu, Lukas Burget, Hynek Hermansky, Jesus, Villalba, Laureano Moro-Velazquez, Najim Dehak

TL;DR
This paper introduces a stabilized training method for joint energy-based models (JEM) that improves training stability and enables practical applications like speech generation and denoising, across various data modalities.
Contribution
The authors propose ST-JEM, a stabilization technique for SGLD-based JEM training, and add a regularization term to improve decision certainty, broadening JEM's applicability.
Findings
Stabilized training enables JEM to be trained on speech data.
The approach improves the quality of generated speech.
JEM can be used for speech denoising and other applications.
Abstract
The recently proposed Joint Energy-based Model (JEM) interprets discriminatively trained classifier as an energy model, which is also trained as a generative model describing the distribution of the input observations . The JEM training relies on "positive examples" (i.e. examples from the training data set) as well as on "negative examples", which are samples from the modeled distribution generated by means of Stochastic Gradient Langevin Dynamics (SGLD). Unfortunately, SGLD often fails to deliver negative samples of sufficient quality during the standard JEM training, which causes a very unbalanced contribution from the positive and negative examples when calculating gradients for JEM updates. As a consequence, the standard JEM training is quite unstable requiring careful tuning of hyper-parameters and frequent restarts when the training starts diverging. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Neural Networks and Applications
