Evaluating Pixel Language Models on Non-Standardized Languages

Alberto Mu\~noz-Ortiz; Verena Blaschke; Barbara Plank

arXiv:2412.09084·cs.CL·December 13, 2024

Evaluating Pixel Language Models on Non-Standardized Languages

Alberto Mu\~noz-Ortiz, Verena Blaschke, Barbara Plank

PDF

Open Access 1 Datasets

TL;DR

Pixel-based language models convert text into images to better handle dialectal and out-of-vocabulary words, outperforming token-based models in several NLP tasks for dialects, especially in zero-shot scenarios.

Contribution

This paper introduces pixel-based language models for transfer learning on dialects, demonstrating their advantages over token-based models in multiple NLP tasks.

Findings

01

Pixel models outperform token models in POS tagging, dependency parsing, and intent detection.

02

Pixel models excel in zero-shot dialect evaluation, with up to 26% improvement.

03

Pixel models underperform in topic classification.

Abstract

We explore the potential of pixel-based models for transfer learning from standard languages to dialects. These models convert text into images that are divided into patches, enabling a continuous vocabulary representation that proves especially useful for out-of-vocabulary words common in dialectal data. Using German as a case study, we compare the performance of pixel-based models to token-based models across various syntactic and semantic tasks. Our results show that pixel-based models outperform token-based models in part-of-speech tagging, dependency parsing and intent detection for zero-shot dialect evaluation by up to 26 percentage points in some scenarios, though not in Standard German. However, pixel-based models fall short in topic classification. These findings emphasize the potential of pixel-based models for handling dialectal data, though further research should be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

callum-canavan/challenges-for-unsupervised-elicitation
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications