Learning Structured Representations of Visual Scenes

Meng-Jiun Chiou

arXiv:2207.04200·cs.CV·July 12, 2022

Learning Structured Representations of Visual Scenes

Meng-Jiun Chiou

PDF

Open Access

TL;DR

This paper explores methods for constructing and learning structured visual scene representations, such as visual relationships, in images and videos, aiming to improve interpretability and reasoning capabilities.

Contribution

It introduces new approaches for learning structured scene representations in static images and videos, incorporating external knowledge and bias reduction techniques.

Findings

01

Improved structured representation learning methods

02

Enhanced interpretability of visual scene models

03

Discussion of open challenges and future directions

Abstract

As the intermediate-level representations bridging the two levels, structured representations of visual scenes, such as visual relationships between pairwise objects, have been shown to not only benefit compositional models in learning to reason along with the structures but provide higher interpretability for model decisions. Nevertheless, these representations receive much less attention than traditional recognition tasks, leaving numerous open challenges unsolved. In the thesis, we study how machines can describe the content of the individual image or video with visual relationships as the structured representations. Specifically, we explore how structured representations of visual scenes can be effectively constructed and learned in both the static-image and video settings, with improvements resulting from external knowledge incorporation, bias-reducing mechanism, and enhanced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning