Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

Quoc-Huy Trinh; Mustapha Abdullahi; Bo Zhao; Debesh Jha

arXiv:2604.04579·cs.CV·April 8, 2026

Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

Quoc-Huy Trinh, Mustapha Abdullahi, Bo Zhao, Debesh Jha

PDF

1 Repo

TL;DR

Firebolt-VL is an efficient vision-language model that replaces traditional cross-attention with a lightweight decoder and correlation module, enabling accurate, fine-grained understanding with lower computational costs.

Contribution

The paper introduces Firebolt-VL, a novel model that uses a Liquid Foundation Model decoder and a Token-Grid Correlation Module for efficient, fine-grained vision-language understanding.

Findings

01

Achieves accurate fine-grained understanding across benchmarks.

02

Maintains linear-time inference for efficiency.

03

Outperforms existing models in resource-constrained scenarios.

Abstract

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://fireboltvl.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.