Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal   Transformers

Zhicheng Huang; Zhaoyang Zeng; Bei Liu; Dongmei Fu; Jianlong Fu

arXiv:2004.00849·cs.CV·June 23, 2020·286 cites

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu

PDF

Open Access 1 Repo

TL;DR

Pixel-BERT introduces a unified deep multi-modal transformer that aligns image pixels with text directly, improving performance on vision-language tasks without relying on region-based features or bounding box annotations.

Contribution

It presents a novel end-to-end model that aligns pixels with text, overcoming limitations of region-based features and reducing annotation costs.

Findings

01

Achieves state-of-the-art results on VQA, image-text retrieval, and NLVR tasks.

02

Boosts VQA performance by 2.17 points over previous SOTA.

03

Uses a novel pixel sampling mechanism and multi-task pre-training.

Abstract

We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks. Our Pixel-BERT which aligns semantic connection in pixel and text level solves the limitation of task-specific visual representation for vision and language tasks. It also relieves the cost of bounding box annotations and overcomes the unbalance between semantic labels in visual task and language semantic. To provide a better representation for down-stream tasks, we pre-train a universal end-to-end model with image and sentence pairs from Visual Genome dataset and MS-COCO dataset. We propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/xpretrain
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsPixel-BERT