Loading paper
Video-Text Representation Learning via Differentiable Weak Temporal Alignment | Tomesphere