Dynamic Scene Understanding from Vision-Language Representations

Shahaf Pruss; Morris Alper; Hadar Averbuch-Elor

arXiv:2501.11653·cs.CV·May 6, 2025

Dynamic Scene Understanding from Vision-Language Representations

Shahaf Pruss, Morris Alper, Hadar Averbuch-Elor

PDF

Open Access

TL;DR

This paper introduces a framework that leverages modern vision-language representations to understand complex, dynamic scenes, achieving state-of-the-art results with minimal task-specific engineering and trainable parameters.

Contribution

It demonstrates how frozen vision-language models can be used for dynamic scene understanding tasks by framing them as structured text prediction or representation concatenation, reducing the need for extensive training.

Findings

01

State-of-the-art performance on dynamic scene understanding tasks

02

Modern V&L representations encode dynamic scene semantics effectively

03

Minimal trainable parameters needed for high performance

Abstract

Images depicting complex, dynamic scenes are challenging to parse automatically, requiring both high-level comprehension of the overall situation and fine-grained identification of participating entities and their interactions. Current approaches use distinct methods tailored to sub-tasks such as Situation Recognition and detection of Human-Human and Human-Object Interactions. However, recent advances in image understanding have often leveraged web-scale vision-language (V&L) representations to obviate task-specific engineering. In this work, we propose a framework for dynamic scene understanding tasks by leveraging knowledge from modern, frozen V&L representations. By framing these tasks in a generic manner - as predicting and parsing structured text, or by directly concatenating representations to the input of existing models - we achieve state-of-the-art results while using a minimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Semantic Web and Ontologies