Loading paper
STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding | Tomesphere