ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection
Maryam Hosseini, Marco Cipriano, Sedigheh Eslami, Daniel Hodczak, Liu, Liu, Andres Sevtsuk, Gerard de Melo

TL;DR
ELSA introduces a comprehensive benchmark for evaluating social activity localization in urban street images, highlighting the limitations of current open-vocabulary detection models in multi-label, multi-activity scenarios.
Contribution
This work provides the first multi-label social activity detection benchmark with over 900 annotated images and introduces novel confidence and aggregation methods for better evaluation.
Findings
Current models struggle with semantic consistency in social activity detection.
Existing models often produce overconfident predictions that lack contextual accuracy.
ELSA reveals significant limitations in state-of-the-art models for multi-label social activity localization.
Abstract
Existing Open Vocabulary Detection (OVD) models exhibit a number of challenges. They often struggle with semantic consistency across diverse inputs, and are often sensitive to slight variations in input phrasing, leading to inconsistent performance. The calibration of their predictive confidence, especially in complex multi-label scenarios, remains suboptimal, frequently resulting in overconfident predictions that do not accurately reflect their context understanding. To understand these limitations, multi-label detection benchmarks are needed. A particularly challenging domain for such benchmarking is social activities. Due to the lack of multi-label benchmarks for social interactions, in this work we present ELSA: Evaluating Localization of Social Activities. ELSA draws on theoretical frameworks in urban sociology and design and uses in-the-wild street-level imagery, where the size of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEnvironmental and Ecological Studies · Latin American Urban Studies
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer · self-DIstillation with NO labels · MDETR
