Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

Giuseppe Samo; Paola Merlo

arXiv:2602.05648·cs.CL·February 6, 2026

Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

Giuseppe Samo, Paola Merlo

PDF

Open Access 1 Video

TL;DR

This study examines how transformer models handle complex verb forms in Turkish and Hebrew, highlighting the importance of tokenization strategies for capturing morphological structures in different languages.

Contribution

It demonstrates that tokenization methods significantly influence the ability of transformer models to represent verbal morphology in Turkish and Hebrew.

Findings

01

Turkish models succeed with both atomic and subword tokenization.

02

Hebrew models diverge: character-level tokenization fails, morpheme-aware segmentation succeeds.

03

Performance improves on synthetic datasets across models.

Abstract

We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew· underline

Taxonomy

TopicsNatural Language Processing Techniques · Language and cultural evolution · Topic Modeling