Redefining part-of-speech classes with distributional semantic models

Andrey Kutuzov; Erik Velldal; Lilja {\O}vrelid

arXiv:1608.03803·cs.CL·August 15, 2016

Redefining part-of-speech classes with distributional semantic models

Andrey Kutuzov, Erik Velldal, Lilja {\O}vrelid

PDF

TL;DR

This study explores how word embeddings can reveal and redefine part-of-speech boundaries, uncovering inconsistencies and supporting graded PoS classifications across languages.

Contribution

It demonstrates that distributional semantic models contain rich PoS information, enabling the discovery of nuanced and potentially inconsistent PoS groupings.

Findings

01

PoS information is distributed across multiple vector components

02

Embeddings can identify hidden inconsistencies in PoS annotation

03

Supports the concept of graded or soft PoS affiliations

Abstract

This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The results show that the information about PoS affiliation contained in the distributional vectors allows us to discover groups of words with distributional patterns that differ from other words of the same part of speech. This data often reveals hidden inconsistencies of the annotation process or guidelines. At the same time, it supports the notion of `soft' or `graded' part of speech affiliations. Finally, we show that information about PoS is distributed among dozens of vector components, not limited to only one or two features.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.