Loading paper
Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers | Tomesphere