Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning
Junyu Lu, Dixiang Zhang, Xiaojun Wu, Xinyu Gao, Ruyi Gan, Jiaxing, Zhang, Yan Song, Pingjian Zhang

TL;DR
Ziya-Visual introduces bilingual large vision-language models that leverage multi-task instruction tuning and multi-modal training to enhance image-text understanding and generation in both English and Chinese.
Contribution
The paper presents Ziya-Visual, a novel bilingual LVLM with multi-task instruction tuning, multi-stage training, and adaptation modules, enabling effective multi-modal dialogue in English and Chinese.
Findings
Achieves competitive performance on English image-text tasks.
Demonstrates effective Chinese multi-modal understanding and generation.
Outperforms existing LVLMs in various benchmarks.
Abstract
Recent advancements enlarge the capabilities of large language models (LLMs) in zero-shot image-to-text generation and understanding by integrating multi-modal inputs. However, such success is typically limited to English scenarios due to the lack of large-scale and high-quality non-English multi-modal resources, making it extremely difficult to establish competitive counterparts in other languages. In this paper, we introduce the Ziya-Visual series, a set of bilingual large-scale vision-language models (LVLMs) designed to incorporate visual semantics into LLM for multi-modal dialogue. Composed of Ziya-Visual-Base and Ziya-Visual-Chat, our models adopt the Querying Transformer from BLIP-2, further exploring the assistance of optimization schemes such as instruction tuning, multi-stage training and low-rank adaptation module for visual-language alignment. In addition, we stimulate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Byte Pair Encoding · Linear Layer · Label Smoothing · Residual Connection · Adam · Absolute Position Encodings · Layer Normalization
