Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task   Instruction Tuning

Junyu Lu; Dixiang Zhang; Xiaojun Wu; Xinyu Gao; Ruyi Gan; Jiaxing; Zhang; Yan Song; Pingjian Zhang

arXiv:2310.08166·cs.CL·November 1, 2023·1 cites

Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Junyu Lu, Dixiang Zhang, Xiaojun Wu, Xinyu Gao, Ruyi Gan, Jiaxing, Zhang, Yan Song, Pingjian Zhang

PDF

Open Access 1 Models

TL;DR

Ziya-Visual introduces bilingual large vision-language models that leverage multi-task instruction tuning and multi-modal training to enhance image-text understanding and generation in both English and Chinese.

Contribution

The paper presents Ziya-Visual, a novel bilingual LVLM with multi-task instruction tuning, multi-stage training, and adaptation modules, enabling effective multi-modal dialogue in English and Chinese.

Findings

01

Achieves competitive performance on English image-text tasks.

02

Demonstrates effective Chinese multi-modal understanding and generation.

03

Outperforms existing LVLMs in various benchmarks.

Abstract

Recent advancements enlarge the capabilities of large language models (LLMs) in zero-shot image-to-text generation and understanding by integrating multi-modal inputs. However, such success is typically limited to English scenarios due to the lack of large-scale and high-quality non-English multi-modal resources, making it extremely difficult to establish competitive counterparts in other languages. In this paper, we introduce the Ziya-Visual series, a set of bilingual large-scale vision-language models (LVLMs) designed to incorporate visual semantics into LLM for multi-modal dialogue. Composed of Ziya-Visual-Base and Ziya-Visual-Chat, our models adopt the Querying Transformer from BLIP-2, further exploring the assistance of optimization schemes such as instruction tuning, multi-stage training and low-rank adaptation module for visual-language alignment. In addition, we stimulate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
IDEA-CCNL/Ziya-Visual-14B-Chat
model· 13 dl· ♡ 7
13 dl♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Byte Pair Encoding · Linear Layer · Label Smoothing · Residual Connection · Adam · Absolute Position Encodings · Layer Normalization