Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
Giuseppe Samo, Paola Merlo

TL;DR
This study examines how transformer models handle complex verb forms in Turkish and Hebrew, highlighting the importance of tokenization strategies for capturing morphological structures in different languages.
Contribution
It demonstrates that tokenization methods significantly influence the ability of transformer models to represent verbal morphology in Turkish and Hebrew.
Findings
Turkish models succeed with both atomic and subword tokenization.
Hebrew models diverge: character-level tokenization fails, morpheme-aware segmentation succeeds.
Performance improves on synthetic datasets across models.
Abstract
We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution · Topic Modeling
