Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study
Volodymyr Ovcharov

TL;DR
This study benchmarks foundation models on Ukrainian legal text, revealing tokenizer fertility's impact on cost and performance, and highlights the importance of domain-specific analysis for zero-shot NLP tasks.
Contribution
It provides a comparative analysis of seven models on Ukrainian legal data, introduces a new dataset, and offers insights into tokenizer efficiency and zero-shot transfer in a low-resource language.
Findings
Qwen 3 models are more token-efficient than Llama models.
NVIDIA Nemotron Super 3 outperforms larger models at lower cost.
Few-shot prompting reduces performance on Ukrainian legal tasks.
Abstract
Tokenizer fertility varies 1.6x across foundation models on Ukrainian legal text, yet this cost-critical dimension is absent from model selection practice. We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Four findings emerge. (1) Qwen 3 models consume 60% more tokens than Llama-family models on identical input, making tokenizer analysis a prerequisite for cost-efficient deployment. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (5.6x more total parameters) at one-third the API cost model scale is a poor proxy for domain performance. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
