Tomato, Tomahto, Tomate: Do Multilingual Language Models Understand Based on Subword-Level Semantic Concepts?

Crystina Zhang; Jing Lu; Vinh Q. Tran; Tal Schuster; Donald Metzler; Jimmy Lin

arXiv:2411.04530·cs.CL·November 20, 2025

Tomato, Tomahto, Tomate: Do Multilingual Language Models Understand Based on Subword-Level Semantic Concepts?

Crystina Zhang, Jing Lu, Vinh Q. Tran, Tal Schuster, Donald Metzler, Jimmy Lin

PDF

Open Access

TL;DR

This study investigates how well multilingual language models understand semantic concepts at the subword level, revealing that shared subword semantics can improve cross-lingual transfer and model predictions.

Contribution

The paper introduces semantic tokens by merging semantically similar subwords and evaluates their impact on multilingual models across various tasks, highlighting the importance of subword semantics.

Findings

01

Shared subword semantics enhance model predictions across languages.

02

Semantic tokens improve zero-shot transfer performance.

03

Subword groups exhibit diverse semantic similarities, including synonyms and translations.

Abstract

Human understanding of text depends on general semantic concepts of words rather than their superficial forms. To what extent does our human intuition transfer to language models? In this work, we study the degree to which current multilingual language models (mLMs) understand based on subword-level semantic concepts. To this end, we form "semantic tokens" by merging the semantically similar subwords and their embeddings, and evaluate the updated mLMs on five heterogeneous multilingual downstream tasks. Results show that the general shared semantics could get the models a long way in making the predictions on mLMs with different tokenizers and model sizes. Inspections of the grouped subwords show that they exhibit a wide range of semantic similarities, including synonyms and translations across many languages and scripts. Lastly, we find that the zero-shot results with semantic tokens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques