Patterns versus Characters in Subword-aware Neural Language Modeling
Rustem Takhanov, Zhenisbek Assylbekov

TL;DR
This paper introduces pattern-based subword representations for neural language models, outperforming character-based models by capturing internal word structures more effectively.
Contribution
It proposes a novel pattern extraction method using CRFs with l1 regularization, improving word representations in language modeling tasks.
Findings
Pattern-based models outperform character-based models by 2-20 perplexity points.
Pattern embeddings match the performance of complex character-based architectures.
Using patterns enhances the representation of internal word structure.
Abstract
Words in some natural languages can have a composite structure. Elements of this structure include the root (that could also be composite), prefixes and suffixes with which various nuances and relations to other words can be expressed. Thus, in order to build a proper word representation one must take into account its internal structure. From a corpus of texts we extract a set of frequent subwords and from the latter set we select patterns, i.e. subwords which encapsulate information on character -gram regularities. The selection is made using the pattern-based Conditional Random Field model with regularization. Further, for every word we construct a new sequence over an alphabet of patterns. The new alphabet's symbols confine a local statistical context stronger than the characters, therefore they allow better representations in and are better building blocks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
