Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations
Josh Beal, Hao-Yu Wu, Dong Huk Park, Andrew Zhai, Dmitry Kislyuk

TL;DR
This paper demonstrates that billion-scale pretraining of vision transformers significantly enhances multi-task visual representations, leading to substantial improvements in a real-world visual shopping system.
Contribution
It introduces a billion-image dataset and a systematic weakly-supervised annotation method, replacing CNNs with Transformers for large-scale visual pretraining in industry applications.
Findings
36% improvement in top-1 relevance
23% increase in click-through volume
Transformers outperform CNNs at billion-scale pretraining
Abstract
Large-scale pretraining of visual representations has led to state-of-the-art performance on a range of benchmark computer vision tasks, yet the benefits of these techniques at extreme scale in complex production systems has been relatively unexplored. We consider the case of a popular visual discovery product, where these representations are trained with multi-task learning, from use-case specific visual understanding (e.g. skin tone classification) to general representation learning for all visual content (e.g. embeddings for retrieval). In this work, we describe how we (1) generate a dataset with over a billion images via large weakly-supervised pretraining to improve the performance of these visual representations, and (2) leverage Transformers to replace the traditional convolutional backbone, with insights into both system and performance improvements, especially at 1B+ image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
