Form and Meaning in Intrinsic Multilingual Evaluations

Wessel Poelman; Miryam de Lhoneux

arXiv:2601.10580·cs.CL·January 16, 2026

Form and Meaning in Intrinsic Multilingual Evaluations

Wessel Poelman, Miryam de Lhoneux

PDF

Open Access 2 Videos

TL;DR

This paper critically examines intrinsic evaluation metrics for multilingual language models, revealing their limitations and the importance of considering the form-meaning relationship in assessing model quality.

Contribution

It explicitly discusses assumptions behind multilingual evaluation metrics and analyzes their implications through experiments, highlighting their non-universality.

Findings

01

Current metrics are not universally comparable across models and languages.

02

Assumptions about semantic equivalence in parallel sentences are often invalid.

03

The form-meaning relationship impacts the validity of intrinsic evaluations.

Abstract

Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforward to use and compare in monolingual setups, but rest on a number of assumptions in multilingual setups. One such assumption is that comparing the perplexity of CLMs on parallel sentences is indicative of their quality since the information content (here understood as the semantic meaning) is the same. However, the metrics are inherently measuring information content in the information-theoretic sense. We make this and other such assumptions explicit and discuss their implications. We perform experiments with six metrics on two multi-parallel corpora both with mono- and multilingual models. Ultimately, we find that current metrics are not universally comparable. We look at the form-meaning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Form and Meaning in Intrinsic Multilingual Evaluations· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods