Florence: A New Foundation Model for Computer Vision

Lu Yuan; Dongdong Chen; Yi-Ling Chen; Noel Codella; Xiyang; Dai; Jianfeng Gao; Houdong Hu; Xuedong Huang; Boxin Li and; Chunyuan Li; Ce Liu; Mengchen Liu; Zicheng Liu; Yumao Lu; Yu; Shi; Lijuan Wang; Jianfeng Wang; Bin Xiao; Zhen Xiao; Jianwei; Yang; Michael Zeng; Luowei Zhou; Pengchuan Zhang

arXiv:2111.11432·cs.CV·November 23, 2021·340 cites

Florence: A New Foundation Model for Computer Vision

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang, Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li and, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu, Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei, Yang, Michael Zeng, Luowei Zhou

PDF

Open Access 2 Repos

TL;DR

Florence is a versatile computer vision foundation model trained on large-scale web data, capable of generalizing across diverse tasks and modalities, achieving state-of-the-art results in numerous benchmarks.

Contribution

Introducing Florence, a new vision foundation model that expands representations across multiple levels and modalities, and demonstrates superior transfer learning and benchmark performance.

Findings

01

Achieves 83.74% top-1 accuracy in ImageNet-1K zero-shot classification.

02

Attains 62.4 mAP on COCO object detection.

03

Reaches 87.8% accuracy on Kinetics-600 action recognition.

Abstract

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsFlorence · ALIGN · Contrastive Language-Image Pre-training