Extraction of V-N-Collocations from Text Corpora: A Feasibility Study for German
Elisabeth Breidt (Seminar f\"ur Sprachwissenschaft, Universit\"at, T\"ubingen, Germany)

TL;DR
This study evaluates a statistical method for extracting verb-noun collocations from German texts, addressing language-specific challenges and optimizing precision and recall for different linguistic and computational goals.
Contribution
It adapts and assesses a statistical approach for German V-N collocation extraction, highlighting modifications to improve accuracy and discussing trade-offs between precision and recall.
Findings
Achieved 97.8% precision with large corpora using restrictive methods.
Less restrictive methods yield higher recall but lower precision.
Large corpora can compensate for lower recall in practical applications.
Abstract
The usefulness of a statistical approach suggested by Church et al. (1991) is evaluated for the extraction of verb-noun (V-N) collocations from German text corpora. Some problematic issues of that method arising from properties of the German language are discussed and various modifications of the method are considered that might improve extraction results for German. The precision and recall of all variant methods is evaluated for V-N collocations containing support verbs, and the consequences for further work on the extraction of collocations from German corpora are discussed. With a sufficiently large corpus (>= 6 mio. word-tokens), the average error rate of wrong extractions can be reduced to 2.2% (97.8% precision) with the most restrictive method, however with a loss in data of almost 50% compared to a less restrictive method with still 87.6% precision. Depending on the goal to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Lexicography and Language Studies · linguistics and terminology studies
