Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics
Preslav Nakov

TL;DR
This paper presents novel unsupervised and lightly supervised methods leveraging the Web as a corpus to analyze noun compound syntax and semantics, achieving state-of-the-art results and improving related NLP tasks.
Contribution
It introduces new surface features and paraphrases for noun compound analysis using Web data, enhancing syntactic disambiguation and semantic understanding.
Findings
State-of-the-art accuracy in noun compound bracketing
Effective application of features to prepositional phrase attachment
Improved machine translation through paraphrasing techniques
Abstract
An important characteristic of English written text is the abundance of noun compounds - sequences of nouns acting as a single noun, e.g., colon cancer tumor suppressor protein. While eventually mastered by domain experts, their interpretation poses a major challenge for automated analysis. Understanding noun compounds' syntax and semantics is important for many natural language applications, including question answering, machine translation, information retrieval, and information extraction. I address the problem of noun compounds syntax by means of novel, highly accurate unsupervised and lightly supervised algorithms using the Web as a corpus and search engines as interfaces to that corpus. Traditionally the Web has been viewed as a source of page hit counts, used as an estimate for n-gram word frequencies. I extend this approach by introducing novel surface features and paraphrases,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
