Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models
Kushal Tatariya, Vladimir Araujo, Thomas Bauwens, Miryam de Lhoneux

TL;DR
This paper investigates the linguistic and visual capabilities of pixel-based language models, revealing a gap between visual and linguistic understanding and how training strategies influence their learning process.
Contribution
It provides a comprehensive analysis of PIXEL's layered representations, highlighting the distinction between visual and linguistic features and the impact of rendering strategies.
Findings
Lower layers capture superficial visual features
Higher layers learn syntactic and semantic abstractions
Orthographic constraints can improve early surface-level learning
Abstract
Pixel-based language models have emerged as a compelling alternative to subword-based language modelling, particularly because they can represent virtually any script. PIXEL, a canonical example of such a model, is a vision transformer that has been pre-trained on rendered text. While PIXEL has shown promising cross-script transfer abilities and robustness to orthographic perturbations, it falls short of outperforming monolingual subword counterparts like BERT in most other contexts. This discrepancy raises questions about the amount of linguistic knowledge learnt by these models and whether their performance in language tasks stems more from their visual capabilities than their linguistic ones. To explore this, we probe PIXEL using a variety of linguistic and visual tasks to assess its position on the vision-to-language spectrum. Our findings reveal a substantial gap between the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Semantic Web and Ontologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · WordPiece · Dropout · Layer Normalization · Adam · Attention Dropout
