Termhood-based Comparability Metrics of Comparable Corpus in Special Domain
Sa Liu, Chengzhi Zhang

TL;DR
This paper introduces a novel termhood-based comparability metric for comparable corpora in special domains, improving bilingual terminology extraction by ranking words by termhood rather than frequency.
Contribution
The paper proposes a new comparability metric based on termhood, enhancing the accuracy of bilingual terminology extraction in specialized domains.
Findings
Termhood-based metrics outperform frequency-based metrics.
Cosine similarity of termhood rankings effectively measures corpus comparability.
Improved corpus comparability aids multilingual information processing.
Abstract
Cross-Language Information Retrieval (CLIR) and machine translation (MT) resources, such as dictionaries and parallel corpora, are scarce and hard to come by for special domains. Besides, these resources are just limited to a few languages, such as English, French, and Spanish and so on. So, obtaining comparable corpora automatically for such domains could be an answer to this problem effectively. Comparable corpora, that the subcorpora are not translations of each other, can be easily obtained from web. Therefore, building and using comparable corpora is often a more feasible option in multilingual information processing. Comparability metrics is one of key issues in the field of building and using comparable corpus. Currently, there is no widely accepted definition or metrics method of corpus comparability. In fact, Different definitions or metrics methods of comparability might be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
