Florence: A New Foundation Model for Computer Vision
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang, Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li and, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu, Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei, Yang, Michael Zeng, Luowei Zhou

TL;DR
Florence is a versatile computer vision foundation model trained on large-scale web data, capable of generalizing across diverse tasks and modalities, achieving state-of-the-art results in numerous benchmarks.
Contribution
Introducing Florence, a new vision foundation model that expands representations across multiple levels and modalities, and demonstrates superior transfer learning and benchmark performance.
Findings
Achieves 83.74% top-1 accuracy in ImageNet-1K zero-shot classification.
Attains 62.4 mAP on COCO object detection.
Reaches 87.8% accuracy on Kinetics-600 action recognition.
Abstract
Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsFlorence · ALIGN · Contrastive Language-Image Pre-training
