Visual Graphs from Motion (VGfM): Scene understanding with object geometry reasoning
Paul Gay, Stuart James, Alessio Del Bue

TL;DR
This paper introduces a novel system that leverages multi-view geometric relations from video sequences to generate 3D scene graphs, enhancing scene understanding by combining geometry and visual features within an RNN framework.
Contribution
It presents a new model that merges geometric and visual features using an RNN to construct 3D scene graphs from video sequences, addressing limitations of single-image scene understanding.
Findings
Effective 3D scene graph generation from multi-view videos
Improved scene understanding through geometric reasoning
New dataset for 3D scene graph tasks
Abstract
Recent approaches on visual scene understanding attempt to build a scene graph -- a computational representation of objects and their pairwise relationships. Such rich semantic representation is very appealing, yet difficult to obtain from a single image, especially when considering complex spatial arrangements in the scene. Differently, an image sequence conveys useful information using the multi-view geometric relations arising from camera motion. Indeed, in such cases, object relationships are naturally related to the 3D scene structure. To this end, this paper proposes a system that first computes the geometrical location of objects in a generic scene and then efficiently constructs scene graphs from video by embedding such geometrical reasoning. Such compelling representation is obtained using a new model where geometric and visual features are merged using an RNN framework. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
