Loading paper
Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions | Tomesphere