Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification
Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney

TL;DR
This paper investigates replacing the inner product in the softmax layer with kernel functions for contextual word classification, analyzing performance, gradient behavior, and disambiguation across language tasks.
Contribution
It introduces the use of kernel functions in the softmax layer for NLP tasks, exploring their effects on performance and model properties.
Findings
Performance varies significantly with different kernels
Kernel mixtures can enhance disambiguation
Gradient properties differ across kernel types
Abstract
Prominently used in support vector machines and logistic regressions, kernel functions (kernels) can implicitly map data points into high dimensional spaces and make it easier to learn complex decision boundaries. In this work, by replacing the inner product function in the softmax layer, we explore the use of kernels for contextual word classification. In order to compare the individual kernels, experiments are conducted on standard language modeling and machine translation tasks. We observe a wide range of performances across different kernel settings. Extending the results, we look at the gradient properties, investigate various mixture strategies and examine the disambiguation abilities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Natural Language Processing Techniques · Topic Modeling
MethodsSoftmax
