Loading paper
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR | Tomesphere