Lost in Evaluation: Misleading Benchmarks for Bilingual Dictionary Induction
Yova Kementchedjhieva, Mareike Hartmann, Anders S{\o}gaard

TL;DR
This paper critically examines the quality of a widely used bilingual dictionary induction benchmark, revealing significant issues that can mislead system evaluations and suggesting more rigorous evaluation practices.
Contribution
It uncovers major flaws in the dataset used for BDI evaluation, demonstrating how these issues impact the perceived performance differences of systems.
Findings
A quarter of the dataset consists of proper nouns, which are not indicative of BDI performance.
Gaps in the gold-standard targets inflate performance differences between systems.
Removing proper nouns increases the performance gap, highlighting dataset biases.
Abstract
The task of bilingual dictionary induction (BDI) is commonly used for intrinsic evaluation of cross-lingual word embeddings. The largest dataset for BDI was generated automatically, so its quality is dubious. We study the composition and quality of the test sets for five diverse languages from this dataset, with concerning findings: (1) a quarter of the data consists of proper nouns, which can be hardly indicative of BDI performance, and (2) there are pervasive gaps in the gold-standard targets. These issues appear to affect the ranking between cross-lingual embedding systems on individual languages, and the overall degree to which the systems differ in performance. With proper nouns removed from the data, the margin between the top two systems included in the study grows from 3.4% to 17.2%. Manual verification of the predictions, on the other hand, reveals that gaps in the gold…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
