PLaMo 2.1-VL Technical Report

Tommi Kerola; Yuya Masuda; Takashi Masuko; Toshiki Nakanishi; Daisuke Nishino; Kuniyuki Takahashi; Hanqin Wang; Yoshihiro Yamada

arXiv:2604.19324·cs.CV·April 22, 2026

PLaMo 2.1-VL Technical Report

Tommi Kerola, Yuya Masuda, Takashi Masuko, Toshiki Nakanishi, Daisuke Nishino, Kuniyuki Takahashi, Hanqin Wang, Yoshihiro Yamada

PDF

TL;DR

PLaMo 2.1-VL is a lightweight Japanese-language vision language model designed for edge deployment, excelling in VQA and grounding tasks with applications in factory and infrastructure analysis.

Contribution

The paper introduces PLaMo 2.1-VL, a compact VLM with synthetic data generation and Japanese language support, optimized for real-world industrial applications.

Findings

01

Outperforms comparable models on Japanese and English benchmarks.

02

Achieves 61.5 ROUGE-L on JA-VG-VQA-500.

03

Fine-tuning improves anomaly detection F1-score from 39.7 to 64.9.

Abstract

We introduce PLaMo 2.1-VL, a lightweight Vision Language Model (VLM) for autonomous devices, available in 8B and 2B variants and designed for local and edge deployment with Japanese-language operation. Focusing on Visual Question Answering (VQA) and Visual Grounding as its core capabilities, we develop and evaluate the models for two real-world application scenarios: factory task analysis via tool recognition, and infrastructure anomaly detection. We also develop a large-scale synthetic data generation pipeline and comprehensive Japanese training and evaluation resources. PLaMo 2.1-VL outperforms comparable open models on Japanese and English benchmarks, achieving 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4. For the two application scenarios, it achieves 53.9% zero-shot accuracy on factory task analysis, and fine-tuning on power plant data improves anomaly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.