Loading paper
Pretrained Image-Text Models are Secretly Video Captioners | Tomesphere