Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support
Xiaojun Wu, Dixiang Zhang, Ruyi Gan, Junyu Lu, Ziwei Wu, Renliang Sun,, Jiaxing Zhang, Pingjian Zhang, Yan Song

TL;DR
Taiyi-Diffusion-XL is a bilingual Chinese-English text-to-image model that extends existing models with enhanced vocabulary, position encoding, and vision-language support, achieving superior bilingual image generation and retrieval performance.
Contribution
The paper introduces Taiyi-Diffusion-XL, a novel bilingual text-to-image model with expanded Chinese vocabulary and improved vision-language integration, advancing open-source multilingual multimodal research.
Findings
Bilingual CLIP model outperforms previous models in image-text retrieval.
Taiyi-Diffusion-XL surpasses prior models in bilingual image generation quality.
Open-sourced model fosters further research in multilingual multimodal AI.
Abstract
Recent advancements in text-to-image models have significantly enhanced image generation capabilities, yet a notable gap of open-source models persists in bilingual or Chinese language support. To address this need, we present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model which is developed by extending the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training. This approach includes the efficient expansion of vocabulary by integrating the most frequently used Chinese characters into CLIP's tokenizer and embedding layers, coupled with an absolute position encoding expansion. Additionally, we enrich text prompts by large vision-language model, leading to better images captions and possess higher visual quality. These enhancements are subsequently applied to downstream text-to-image models. Our empirical results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
