Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian, Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina, Toutanova

TL;DR
Pix2Struct is a versatile pretrained model that parses masked web page screenshots into HTML, enabling it to excel across diverse visually-situated language tasks by unifying multiple signals like OCR and captioning.
Contribution
The paper introduces Pix2Struct, a novel pretraining approach using screenshot parsing into HTML, with flexible input integration, achieving state-of-the-art results across multiple domains.
Findings
Achieves state-of-the-art on six out of nine tasks.
Effective across documents, illustrations, UIs, and natural images.
Demonstrates the power of visual language pretraining with HTML parsing.
Abstract
Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/pix2struct-textcaps-basemodel· 8.6k dl· ♡ 298.6k dl♡ 29
- 🤗google/pix2struct-widget-captioning-largemodel· 33 dl· ♡ 2033 dl♡ 20
- 🤗google/pix2struct-textcaps-largemodel· 25 dl· ♡ 1425 dl♡ 14
- 🤗google/pix2struct-basemodel· 2.4k dl· ♡ 792.4k dl♡ 79
- 🤗google/pix2struct-ai2d-basemodel· 1.6k dl· ♡ 431.6k dl♡ 43
- 🤗google/pix2struct-chartqa-basemodel· 1.1k dl· ♡ 101.1k dl♡ 10
- 🤗google/pix2struct-docvqa-largemodel· 157 dl· ♡ 32157 dl♡ 32
- 🤗google/pix2struct-docvqa-basemodel· 2.8k dl· ♡ 442.8k dl♡ 44
- 🤗google/pix2struct-widget-captioning-basemodel· 19 dl· ♡ 619 dl♡ 6
- 🤗google/pix2struct-ocrvqa-largemodel· 23 dl· ♡ 3423 dl♡ 34
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
