VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

Jiaxin Fan; Wenpo Song

arXiv:2603.04957·cs.CV·March 6, 2026

VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

Jiaxin Fan, Wenpo Song

PDF

Open Access

TL;DR

VisionPangu is a compact 1.7B-parameter multimodal model that enhances detailed image captioning by combining efficient multimodal alignment, high-quality supervision, and instruction tuning, achieving competitive performance with more structured descriptions.

Contribution

The paper introduces VisionPangu, a novel compact multimodal model that improves detailed image captioning through efficient alignment and high-quality supervision, without large-scale architectures.

Findings

01

Achieves competitive captioning performance with only 1.7B parameters.

02

Produces more structured and detailed image captions.

03

Effectively incorporates dense human-authored descriptions for semantic richness.

Abstract

Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling