Loading paper
VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models | Tomesphere