Loading paper
Joint learning of images and videos with a single Vision Transformer | Tomesphere