Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based   Language Models

Kushal Tatariya; Vladimir Araujo; Thomas Bauwens; Miryam de Lhoneux

arXiv:2410.12011·cs.CL·October 17, 2024

Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models

Kushal Tatariya, Vladimir Araujo, Thomas Bauwens, Miryam de Lhoneux

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the linguistic and visual capabilities of pixel-based language models, revealing a gap between visual and linguistic understanding and how training strategies influence their learning process.

Contribution

It provides a comprehensive analysis of PIXEL's layered representations, highlighting the distinction between visual and linguistic features and the impact of rendering strategies.

Findings

01

Lower layers capture superficial visual features

02

Higher layers learn syntactic and semantic abstractions

03

Orthographic constraints can improve early surface-level learning

Abstract

Pixel-based language models have emerged as a compelling alternative to subword-based language modelling, particularly because they can represent virtually any script. PIXEL, a canonical example of such a model, is a vision transformer that has been pre-trained on rendered text. While PIXEL has shown promising cross-script transfer abilities and robustness to orthographic perturbations, it falls short of outperforming monolingual subword counterparts like BERT in most other contexts. This discrepancy raises questions about the amount of linguistic knowledge learnt by these models and whether their performance in language tasks stems more from their visual capabilities than their linguistic ones. To explore this, we probe PIXEL using a variety of linguistic and visual tasks to assess its position on the vision-to-language spectrum. Our findings reveal a substantial gap between the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kushaltatariya/Pixology
pytorchOfficial

Videos

Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Semantic Web and Ontologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · WordPiece · Dropout · Layer Normalization · Adam · Attention Dropout