jina-vlm: Small Multilingual Vision Language Model

Andreas Koukounas; Georgios Mastrapas; Florian H\"onicke; Sedigheh Eslami; Guillaume Roncari; Scott Martens; Han Xiao

arXiv:2512.04032·cs.CL·May 5, 2026

jina-vlm: Small Multilingual Vision Language Model

Andreas Koukounas, Georgios Mastrapas, Florian H\"onicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao

PDF

1 Repo 2 Models

TL;DR

jina-vlm is a 2.4B parameter multilingual vision-language model that achieves state-of-the-art performance in VQA tasks among open models of similar scale, using efficient image processing techniques.

Contribution

The paper introduces jina-vlm, a token-efficient multilingual VLM with novel image tiling and attention pooling, and provides insights into data category importance through ablation studies.

Findings

01

Achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs.

02

Demonstrates effective token-efficient processing of arbitrary-resolution images.

03

Provides a systematic analysis of data category contributions via ablation studies.

Abstract

We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images. To understand the contribution of different training data categories, we conduct a leave-one-out data mixture ablation study-systematically removing task, domain, modality, and language categories-to diagnose which data types are necessary versus redundant and whether task benefits transfer across domains. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/jinaai/jina-vlm
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.