Loading paper
Building a Video-and-Language Dataset with Human Actions for Multimodal Logical Inference | Tomesphere