Pix2Struct: Screenshot Parsing as Pretraining for Visual Language   Understanding

Kenton Lee; Mandar Joshi; Iulia Turc; Hexiang Hu; Fangyu Liu; Julian; Eisenschlos; Urvashi Khandelwal; Peter Shaw; Ming-Wei Chang; Kristina; Toutanova

arXiv:2210.03347·cs.CL·June 19, 2023·46 cites

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian, Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina, Toutanova

PDF

Open Access 4 Repos 10 Models 1 Datasets

TL;DR

Pix2Struct is a versatile pretrained model that parses masked web page screenshots into HTML, enabling it to excel across diverse visually-situated language tasks by unifying multiple signals like OCR and captioning.

Contribution

The paper introduces Pix2Struct, a novel pretraining approach using screenshot parsing into HTML, with flexible input integration, achieving state-of-the-art results across multiple domains.

Findings

01

Achieves state-of-the-art on six out of nine tasks.

02

Effective across documents, illustrations, UIs, and natural images.

03

Demonstrates the power of visual language pretraining with HTML parsing.

Abstract

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

ivelin/ui_refexp
dataset· 61 dl
61 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques