Loading paper
Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval | Tomesphere