Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput

Bo Zhang; Shuo Li; Runhe Tian; Yang Yang; Jixin Tang; Jinhao Zhou; Lin Ma

arXiv:2505.09498·cs.CV·May 15, 2025

Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput

Bo Zhang, Shuo Li, Runhe Tian, Yang Yang, Jixin Tang, Jinhao Zhou, Lin Ma

PDF

Open Access 4 Models

TL;DR

Flash-VL 2B is a new vision-language model optimized for ultra-low latency and high throughput, achieving state-of-the-art speed and accuracy for real-time applications without sacrificing performance.

Contribution

It introduces architectural enhancements, token compression, and a novel implicit semantic stitching technique to optimize VLMs for real-time deployment.

Findings

01

Achieves state-of-the-art speed and accuracy on 11 benchmarks.

02

Maintains competitive performance with reduced processing time.

03

Effective in resource-constrained and large-scale real-time environments.

Abstract

In this paper, we introduce Flash-VL 2B, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Satellite Image Processing and Photogrammetry · Advanced Vision and Imaging

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings