Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn't Help with MT Evaluation
Petra Baran\v{c}\'ikov\'a, Ond\v{r}ej Bojar

TL;DR
This paper compares Czech-specific and multilingual sentence embeddings using intrinsic semantic tests and extrinsic machine translation evaluation, revealing a disconnect between semantic similarity performance and translation quality.
Contribution
It provides a comprehensive analysis of how intrinsic semantic evaluations relate to downstream translation tasks, highlighting the complexity of operationalizable semantics in sentence embeddings.
Findings
Models excelling in semantic similarity tests do not always perform better in translation evaluation.
Over-smoothed embeddings can achieve high translation scores after fine-tuning.
Intrinsic semantic metrics may not reliably predict downstream translation performance.
Abstract
In this paper, we compare Czech-specific and multilingual sentence embedding models through intrinsic and extrinsic evaluation paradigms. For intrinsic evaluation, we employ Costra, a complex sentence transformation dataset, and several Semantic Textual Similarity (STS) benchmarks to assess the ability of the embeddings to capture linguistic phenomena such as semantic similarity, temporal aspects, and stylistic variations. In the extrinsic evaluation, we fine-tune each embedding model using COMET-based metrics for machine translation evaluation. Our experiments reveal an interesting disconnect: models that excel in intrinsic semantic similarity tests do not consistently yield superior performance on downstream translation evaluation tasks. Conversely, models with seemingly over-smoothed embedding spaces can, through fine-tuning, achieve excellent results. These findings highlight the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDiscourse Analysis and Cultural Communication
