# Pitfalls in the Evaluation of Sentence Embeddings

**Authors:** Steffen Eger, Andreas R\"uckl\'e, Iryna Gurevych

arXiv: 1906.01575 · 2019-06-05

## TL;DR

This paper identifies key pitfalls in evaluating sentence embeddings, such as size comparison issues and normalization effects, and offers recommendations for more reliable future assessments.

## Contribution

It systematically highlights evaluation pitfalls in sentence embeddings and proposes improved practices to enhance assessment reliability.

## Key findings

- Comparison of embeddings of different sizes can be misleading
- Normalization significantly impacts evaluation results
- Transfer and probing task correlations are often low and inconsistent

## Abstract

Deep learning models continuously break new records across different NLP tasks. At the same time, their success exposes weaknesses of model evaluation. Here, we compile several key pitfalls of evaluation of sentence embeddings, a currently very popular NLP paradigm. These pitfalls include the comparison of embeddings of different sizes, normalization of embeddings, and the low (and diverging) correlations between transfer and probing tasks. Our motivation is to challenge the current evaluation of sentence embeddings and to provide an easy-to-access reference for future research. Based on our insights, we also recommend better practices for better future evaluations of sentence embeddings.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.01575/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/1906.01575/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/1906.01575/full.md

---
Source: https://tomesphere.com/paper/1906.01575