Lost in Evaluation: Misleading Benchmarks for Bilingual Dictionary   Induction

Yova Kementchedjhieva; Mareike Hartmann; Anders S{\o}gaard

arXiv:1909.05708·cs.CL·September 19, 2019·1 cites

Lost in Evaluation: Misleading Benchmarks for Bilingual Dictionary Induction

Yova Kementchedjhieva, Mareike Hartmann, Anders S{\o}gaard

PDF

Open Access 2 Repos

TL;DR

This paper critically examines the quality of a widely used bilingual dictionary induction benchmark, revealing significant issues that can mislead system evaluations and suggesting more rigorous evaluation practices.

Contribution

It uncovers major flaws in the dataset used for BDI evaluation, demonstrating how these issues impact the perceived performance differences of systems.

Findings

01

A quarter of the dataset consists of proper nouns, which are not indicative of BDI performance.

02

Gaps in the gold-standard targets inflate performance differences between systems.

03

Removing proper nouns increases the performance gap, highlighting dataset biases.

Abstract

The task of bilingual dictionary induction (BDI) is commonly used for intrinsic evaluation of cross-lingual word embeddings. The largest dataset for BDI was generated automatically, so its quality is dubious. We study the composition and quality of the test sets for five diverse languages from this dataset, with concerning findings: (1) a quarter of the data consists of proper nouns, which can be hardly indicative of BDI performance, and (2) there are pervasive gaps in the gold-standard targets. These issues appear to affect the ranking between cross-lingual embedding systems on individual languages, and the overall degree to which the systems differ in performance. With proper nouns removed from the data, the margin between the top two systems included in the study grows from 3.4% to 17.2%. Manual verification of the predictions, on the other hand, reveals that gaps in the gold…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification