UrbanClipAtlas: A Visual Analytics Framework for Event and Scene Retrieval in Urban Videos
Joel Perca, Luis Sante, Juanpablo Heredia, Joao Rulff, Claudio Silva, Jorge Poco

TL;DR
URBANCLIPATLAS is a visual analytics system that enables efficient event and scene retrieval in long urban videos by combining video segmentation, textual description generation, and knowledge graph-based grounding.
Contribution
It introduces a novel integrated framework that combines retrieval-augmented generation, taxonomy-aware entity extraction, and visual grounding for urban video analysis.
Findings
Supports scene retrieval via chat-based interface.
Reduces effort in validating model outputs.
Demonstrated effectiveness on StreetAware dataset.
Abstract
Extracting actionable insights from long-duration urban videos is often labor-intensive: analysts must manually sift through raw footage to pinpoint target events or uncover broader behavioral trends. In this work, we present URBANCLIPATLAS, a visual analytics system for exploring long urban videos recorded at street intersections. URBANCLIPATLAS combines retrieval-augmented generation (RAG), taxonomy-aware entity extraction, and video grounding to support event retrieval and interpretation. The system segments extended recordings into short clips, generates textual descriptions with a vision-language model, and indexes them for semantic retrieval. A knowledge graph maps entities and relations from LLM answers onto a domain-specific taxonomy and aligns them with detected objects and trajectories to support visual grounding and verification. URBANCLIPATLAS supports scene retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
