Corpus analysis without prior linguistic knowledge - unsupervised mining   of phrases and subphrase structure

Stefan Gerdjikov; Klaus U. Schulz

arXiv:1602.05772·cs.CL·February 19, 2016·1 cites

Corpus analysis without prior linguistic knowledge - unsupervised mining of phrases and subphrase structure

Stefan Gerdjikov, Klaus U. Schulz

PDF

Open Access

TL;DR

This paper presents an unsupervised, language-independent approach to identify phrases and subphrase structures in corpora, aiming to automatically build dictionaries and grammars without prior linguistic knowledge.

Contribution

It introduces novel corpus-based methods for unsupervised phrase and subphrase detection applicable across languages and data types, advancing automatic linguistic structure discovery.

Findings

01

Effective in multiple languages for phrase detection

02

Potential applications in text mining and lexicography

03

Foundation for automatic dictionary and grammar creation

Abstract

When looking at the structure of natural language, "phrases" and "words" are central notions. We consider the problem of identifying such "meaningful subparts" of language of any length and underlying composition principles in a completely corpus-based and language-independent way without using any kind of prior linguistic knowledge. Unsupervised methods for identifying "phrases", mining subphrase structure and finding words in a fully automated way are described. This can be considered as a step towards automatically computing a "general dictionary and grammar of the corpus". We hope that in the long run variants of our approach turn out to be useful for other kind of sequence data as well, such as, e.g., speech, genom sequences, or music annotation. Even if we are not primarily interested in immediate applications, results obtained for a variety of languages show that our methods are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Advanced Text Analysis Techniques · Topic Modeling