Loading paper
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment | Tomesphere