Loading paper
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding | Tomesphere