Loading paper
Unifying Vision-Language Representation Space with Single-tower Transformer | Tomesphere