Dynamic Scene Understanding from Vision-Language Representations
Shahaf Pruss, Morris Alper, Hadar Averbuch-Elor

TL;DR
This paper introduces a framework that leverages modern vision-language representations to understand complex, dynamic scenes, achieving state-of-the-art results with minimal task-specific engineering and trainable parameters.
Contribution
It demonstrates how frozen vision-language models can be used for dynamic scene understanding tasks by framing them as structured text prediction or representation concatenation, reducing the need for extensive training.
Findings
State-of-the-art performance on dynamic scene understanding tasks
Modern V&L representations encode dynamic scene semantics effectively
Minimal trainable parameters needed for high performance
Abstract
Images depicting complex, dynamic scenes are challenging to parse automatically, requiring both high-level comprehension of the overall situation and fine-grained identification of participating entities and their interactions. Current approaches use distinct methods tailored to sub-tasks such as Situation Recognition and detection of Human-Human and Human-Object Interactions. However, recent advances in image understanding have often leveraged web-scale vision-language (V&L) representations to obviate task-specific engineering. In this work, we propose a framework for dynamic scene understanding tasks by leveraging knowledge from modern, frozen V&L representations. By framing these tasks in a generic manner - as predicting and parsing structured text, or by directly concatenating representations to the input of existing models - we achieve state-of-the-art results while using a minimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Semantic Web and Ontologies
