Florence-VL: Enhancing Vision-Language Models with Generative Vision   Encoder and Depth-Breadth Fusion

Jiuhai Chen; Jianwei Yang; Haiping Wu; Dianqi Li; Jianfeng Gao; Tianyi; Zhou; Bin Xiao

arXiv:2412.04424·cs.CV·December 6, 2024

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi, Zhou, Bin Xiao

PDF

Open Access 1 Repo

TL;DR

Florence-VL introduces a novel multimodal model that integrates a generative vision encoder with depth-breadth fusion, significantly improving vision-language understanding and performance across diverse benchmarks.

Contribution

The paper presents Florence-VL, a new multimodal large language model that leverages Florence-2's versatile visual features and a novel depth-breadth fusion architecture, enhancing vision-language tasks.

Findings

01

Outperforms existing models on multiple vision-language benchmarks.

02

Enriched visual features improve vision-language alignment.

03

Depth-breadth fusion enhances multi-level visual feature integration.

Abstract

We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiuhaichen/florence-vl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization

MethodsLinear Layer · Softmax · Multi-Head Attention · LLaMA · Dense Connections · Layer Normalization · Residual Connection · Attention Is All You Need · Vision Transformer