Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual   Representations

Josh Beal; Hao-Yu Wu; Dong Huk Park; Andrew Zhai; Dmitry Kislyuk

arXiv:2108.05887·cs.CV·August 13, 2021

Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations

Josh Beal, Hao-Yu Wu, Dong Huk Park, Andrew Zhai, Dmitry Kislyuk

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that billion-scale pretraining of vision transformers significantly enhances multi-task visual representations, leading to substantial improvements in a real-world visual shopping system.

Contribution

It introduces a billion-image dataset and a systematic weakly-supervised annotation method, replacing CNNs with Transformers for large-scale visual pretraining in industry applications.

Findings

01

36% improvement in top-1 relevance

02

23% increase in click-through volume

03

Transformers outperform CNNs at billion-scale pretraining

Abstract

Large-scale pretraining of visual representations has led to state-of-the-art performance on a range of benchmark computer vision tasks, yet the benefits of these techniques at extreme scale in complex production systems has been relatively unexplored. We consider the case of a popular visual discovery product, where these representations are trained with multi-task learning, from use-case specific visual understanding (e.g. skin tone classification) to general representation learning for all visual content (e.g. embeddings for retrieval). In this work, we describe how we (1) generate a dataset with over a billion images via large weakly-supervised pretraining to improve the performance of these visual representations, and (2) leverage Transformers to replace the traditional convolutional backbone, with insights into both system and performance improvements, especially at 1B+ image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques