Loading paper
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding | Tomesphere