TL;DR
jina-vlm is a 2.4B parameter multilingual vision-language model that achieves state-of-the-art performance in VQA tasks among open models of similar scale, using efficient image processing techniques.
Contribution
The paper introduces jina-vlm, a token-efficient multilingual VLM with novel image tiling and attention pooling, and provides insights into data category importance through ablation studies.
Findings
Achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs.
Demonstrates effective token-efficient processing of arbitrary-resolution images.
Provides a systematic analysis of data category contributions via ablation studies.
Abstract
We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images. To understand the contribution of different training data categories, we conduct a leave-one-out data mixture ablation study-systematically removing task, domain, modality, and language categories-to diagnose which data types are necessary versus redundant and whether task benefits transfer across domains. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
