TL;DR
This paper investigates surrogate gradient methods, particularly SPIGOT, for training latent structure models in language processing, providing a new perspective, algorithms, and empirical insights into their effectiveness and failure modes.
Contribution
It offers a principled motivation for SPIGOT and related estimators, introduces new algorithms, and compares their performance with existing methods.
Findings
SPIGOT and STE can be derived from a pulled-back objective perspective.
New algorithms in the same family outperform some existing estimators.
Empirical results reveal failure cases and practical insights for structured latent models.
Abstract
Latent structure models are a powerful tool for modeling language data: they can mitigate the error propagation and annotation bottleneck in pipeline systems, while simultaneously uncovering linguistic insights about the data. One challenge with end-to-end training of these models is the argmax operation, which has null gradient. In this paper, we focus on surrogate gradients, a popular strategy to deal with this problem. We explore latent structure learning through the angle of pulling back the downstream learning objective. In this paradigm, we discover a principled motivation for both the straight-through estimator (STE) as well as the recently-proposed SPIGOT - a variant of STE for structured models. Our perspective leads to new algorithms in the same family. We empirically compare the known and the novel pulled-back estimators against the popular alternatives, yielding new insight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
