Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi, Zhou, Bin Xiao

TL;DR
Florence-VL introduces a novel multimodal model that integrates a generative vision encoder with depth-breadth fusion, significantly improving vision-language understanding and performance across diverse benchmarks.
Contribution
The paper presents Florence-VL, a new multimodal large language model that leverages Florence-2's versatile visual features and a novel depth-breadth fusion architecture, enhancing vision-language tasks.
Findings
Outperforms existing models on multiple vision-language benchmarks.
Enriched visual features improve vision-language alignment.
Depth-breadth fusion enhances multi-level visual feature integration.
Abstract
We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
MethodsLinear Layer · Softmax · Multi-Head Attention · LLaMA · Dense Connections · Layer Normalization · Residual Connection · Attention Is All You Need · Vision Transformer
