Loading paper
MMViR: A Multi-Modal and Multi-Granularity Representation for Long-range Video Understanding | Tomesphere