Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with Images as Pivots
Jinyi Hu, Xu Han, Xiaoyuan Yi, Yutong Chen, Wenhao Li, Zhiyuan Liu,, Maosong Sun

TL;DR
This paper introduces IAP, a novel method that efficiently transfers English Stable Diffusion to Chinese by aligning Chinese semantics with English in CLIP space using images as pivots, requiring minimal training data.
Contribution
IAP is a simple, effective approach that leverages images as pivots to align Chinese and English semantics in CLIP, enabling cross-lingual diffusion without extensive retraining.
Findings
Outperforms strong Chinese diffusion models with only 5-10% training data
Establishes efficient connections between Chinese, English, and visual semantics in CLIP
Improves image generation quality with direct Chinese prompts
Abstract
Diffusion models have made impressive progress in text-to-image synthesis. However, training such large-scale models (e.g. Stable Diffusion), from scratch requires high computational costs and massive high-quality text-image pairs, which becomes unaffordable in other languages. To handle this challenge, we propose IAP, a simple but effective method to transfer English Stable Diffusion into Chinese. IAP optimizes only a separate Chinese text encoder with all other parameters fixed to align Chinese semantics space to the English one in CLIP. To achieve this, we innovatively treat images as pivots and minimize the distance of attentive features produced from cross-attention between images and each language respectively. In this way, IAP establishes connections of Chinese, English and visual semantics in CLIP's embedding space efficiently, advancing the quality of the generated image with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
MethodsContrastive Language-Image Pre-training · Diffusion · ALIGN
