Loading paper
VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning | Tomesphere