Are word boundaries useful for unsupervised language learning?

Tu Anh Nguyen; Maureen de Seyssel; Robin Algayres; Patricia Roze; Ewan; Dunbar; Emmanuel Dupoux

arXiv:2210.02956·cs.CL·October 7, 2022·5 cites

Are word boundaries useful for unsupervised language learning?

Tu Anh Nguyen, Maureen de Seyssel, Robin Algayres, Patricia Roze, Ewan, Dunbar, Emmanuel Dupoux

PDF

Open Access

TL;DR

This study investigates whether word boundary information improves unsupervised language learning by comparing models with different input units and boundary information, showing that boundaries significantly enhance performance.

Contribution

It systematically evaluates the impact of boundary information in language models and demonstrates that unsupervised boundary detection can effectively substitute gold boundaries.

Findings

01

Boundaries improve model performance by up to 28%.

02

Unsupervised boundary detection yields performance gains.

03

Boundary information is crucial for linguistic tasks.

Abstract

Word or word-fragment based Language Models (LM) are typically preferred over character-based ones in many downstream applications. This may not be surprising as words seem more linguistically relevant units than characters. Words provide at least two kinds of relevant information: boundary information and meaningful units. However, word boundary information may be absent or unreliable in the case of speech input (word boundaries are not marked explicitly in the speech stream). Here, we systematically compare LSTMs as a function of the input unit (character, phoneme, word, word part), with or without gold boundary information. We probe linguistic knowledge in the networks at the lexical, syntactic and semantic levels using three speech-adapted black box NLP psycholinguistically-inpired benchmarks (pWUGGY, pBLIMP, pSIMI). We find that the absence of boundaries costs between 2\% and 28\%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis