Scaling Matters in Deep Structured-Prediction Models
Aleksandr Shevchenko, Anton Osokin

TL;DR
This paper investigates the challenges of joint training in deep structured-prediction models, proposing scaling algorithms to improve training stability and effectiveness across multiple tasks.
Contribution
It introduces online and offline scaling algorithms to address normalization issues, enabling successful end-to-end training of deep energy-based models.
Findings
Scaling algorithms improve joint training stability
Algorithms outperform multistage training approaches
Effective across diverse tasks
Abstract
Deep structured-prediction energy-based models combine the expressive power of learned representations and the ability of embedding knowledge about the task at hand into the system. A common way to learn parameters of such models consists in a multistage procedure where different combinations of components are trained at different stages. The joint end-to-end training of the whole system is then done as the last fine-tuning stage. This multistage approach is time-consuming and cumbersome as it requires multiple runs until convergence and multiple rounds of hyperparameter tuning. From this point of view, it is beneficial to start the joint training procedure from the beginning. However, such approaches often unexpectedly fail and deliver results worse than the multistage ones. In this paper, we hypothesize that one reason for joint training of deep energy-based models to fail is the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Machine Learning and Data Classification
