ZAYA1-VL-8B Technical Report

Hassan Shapourian; Kasra Hejazi; Olabode M. Sule; Beren Millidge

arXiv:2605.08560·cs.CV·May 12, 2026

ZAYA1-VL-8B Technical Report

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge

PDF

1 Repo 1 Models

TL;DR

ZAYA1-VL-8B is a compact vision-language model that achieves competitive performance with larger models through innovative architecture and training techniques, and is publicly available.

Contribution

The paper introduces vision-specific LoRA adapters and bidirectional attention mechanisms to enhance visual understanding in a compact model.

Findings

01

Outperforms several larger models on image understanding benchmarks.

02

Incorporates novel vision-specific LoRA adapters for modality capacity.

03

Achieves competitive performance with a 9.2B parameter model.

Abstract

We present ZAYA1-VL-8B, a compact mixture-of-experts vision-language model built upon our in-house language model, ZAYA1-8B. Despite its compact size, ZAYA1-VL achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing models including Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B across a range of image understanding, reasoning, and counting benchmarks. The architecture incorporates two key innovations: (1) vision-specific LoRA adapters integrated into the LLM to increase modality-specific capacity without increasing the number of experts, and (2) bidirectional attention over image tokens within the LLM to enhance visual understanding. We detail the full training pipeline including data composition at each stage, sequence packing, and the attention masking scheme. The model comprises 9.2B total parameters, with 1.4B active parameters…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/Zyphra/ZAYA1-VL
github

Models

🤗
Zyphra/ZAYA1-VL-8B
model· 578 dl· ♡ 38
578 dl♡ 38

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.