Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with   Large Vision-Language Model Support

Xiaojun Wu; Dixiang Zhang; Ruyi Gan; Junyu Lu; Ziwei Wu; Renliang Sun,; Jiaxing Zhang; Pingjian Zhang; Yan Song

arXiv:2401.14688·cs.CL·June 19, 2024·2 cites

Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

Xiaojun Wu, Dixiang Zhang, Ruyi Gan, Junyu Lu, Ziwei Wu, Renliang Sun,, Jiaxing Zhang, Pingjian Zhang, Yan Song

PDF

Open Access 1 Repo 1 Models

TL;DR

Taiyi-Diffusion-XL is a bilingual Chinese-English text-to-image model that extends existing models with enhanced vocabulary, position encoding, and vision-language support, achieving superior bilingual image generation and retrieval performance.

Contribution

The paper introduces Taiyi-Diffusion-XL, a novel bilingual text-to-image model with expanded Chinese vocabulary and improved vision-language integration, advancing open-source multilingual multimodal research.

Findings

01

Bilingual CLIP model outperforms previous models in image-text retrieval.

02

Taiyi-Diffusion-XL surpasses prior models in bilingual image generation quality.

03

Open-sourced model fosters further research in multilingual multimodal AI.

Abstract

Recent advancements in text-to-image models have significantly enhanced image generation capabilities, yet a notable gap of open-source models persists in bilingual or Chinese language support. To address this need, we present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model which is developed by extending the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training. This approach includes the efficient expansion of vocabulary by integrating the most frequently used Chinese characters into CLIP's tokenizer and embedding layers, coupled with an absolute position encoding expansion. Additionally, we enrich text prompts by large vision-language model, leading to better images captions and possess higher visual quality. These enhancements are subsequently applied to downstream text-to-image models. Our empirical results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IDEA-CCNL/Taiyi-Diffusion-XL
pytorchOfficial

Models

🤗
IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B
model· 97 dl· ♡ 61
97 dl♡ 61

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training