Termhood-based Comparability Metrics of Comparable Corpus in Special   Domain

Sa Liu; Chengzhi Zhang

arXiv:1302.4489·cs.CL·February 20, 2013

Termhood-based Comparability Metrics of Comparable Corpus in Special Domain

Sa Liu, Chengzhi Zhang

PDF

TL;DR

This paper introduces a novel termhood-based comparability metric for comparable corpora in special domains, improving bilingual terminology extraction by ranking words by termhood rather than frequency.

Contribution

The paper proposes a new comparability metric based on termhood, enhancing the accuracy of bilingual terminology extraction in specialized domains.

Findings

01

Termhood-based metrics outperform frequency-based metrics.

02

Cosine similarity of termhood rankings effectively measures corpus comparability.

03

Improved corpus comparability aids multilingual information processing.

Abstract

Cross-Language Information Retrieval (CLIR) and machine translation (MT) resources, such as dictionaries and parallel corpora, are scarce and hard to come by for special domains. Besides, these resources are just limited to a few languages, such as English, French, and Spanish and so on. So, obtaining comparable corpora automatically for such domains could be an answer to this problem effectively. Comparable corpora, that the subcorpora are not translations of each other, can be easily obtained from web. Therefore, building and using comparable corpora is often a more feasible option in multilingual information processing. Comparability metrics is one of key issues in the field of building and using comparable corpus. Currently, there is no widely accepted definition or metrics method of corpus comparability. In fact, Different definitions or metrics methods of comparability might be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.