Corpus analysis without prior linguistic knowledge - unsupervised mining of phrases and subphrase structure
Stefan Gerdjikov, Klaus U. Schulz

TL;DR
This paper presents an unsupervised, language-independent approach to identify phrases and subphrase structures in corpora, aiming to automatically build dictionaries and grammars without prior linguistic knowledge.
Contribution
It introduces novel corpus-based methods for unsupervised phrase and subphrase detection applicable across languages and data types, advancing automatic linguistic structure discovery.
Findings
Effective in multiple languages for phrase detection
Potential applications in text mining and lexicography
Foundation for automatic dictionary and grammar creation
Abstract
When looking at the structure of natural language, "phrases" and "words" are central notions. We consider the problem of identifying such "meaningful subparts" of language of any length and underlying composition principles in a completely corpus-based and language-independent way without using any kind of prior linguistic knowledge. Unsupervised methods for identifying "phrases", mining subphrase structure and finding words in a fully automated way are described. This can be considered as a step towards automatically computing a "general dictionary and grammar of the corpus". We hope that in the long run variants of our approach turn out to be useful for other kind of sequence data as well, such as, e.g., speech, genom sequences, or music annotation. Even if we are not primarily interested in immediate applications, results obtained for a variety of languages show that our methods are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Text Analysis Techniques · Topic Modeling
