Loading paper
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment | Tomesphere