TL;DR
This paper investigates how structural design choices in probing tasks affect the evaluation of sentence embeddings, especially in low-resource languages, highlighting the variability and need for multilingual approaches.
Contribution
It provides the first large-scale study on the sensitivity of probing results to design choices and emphasizes the importance of multilingual evaluation for sentence embeddings.
Findings
Design choices like dataset size and classifier type significantly influence probing outcomes.
Probing results for English do not necessarily transfer to other languages.
Multilingual probing evaluations should be conducted for fairer assessment.
Abstract
Sentence encoders map sentences to real valued vectors for use in downstream applications. To peek into these representations - e.g., to increase interpretability of their results - probing tasks have been designed which query them for linguistic knowledge. However, designing probing tasks for lesser-resourced languages is tricky, because these often lack large-scale annotated data or (high-quality) dependency parsers as a prerequisite of probing task design in English. To investigate how to probe sentence embeddings in such cases, we investigate sensitivity of probing task results to structural design choices, conducting the first such large scale study. We show that design choices like size of the annotated probing dataset and type of classifier used for evaluation do (sometimes substantially) influence probing outcomes. We then probe embeddings in a multilingual setup with design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsInterpretability
