Learning Structured Representations of Visual Scenes
Meng-Jiun Chiou

TL;DR
This paper explores methods for constructing and learning structured visual scene representations, such as visual relationships, in images and videos, aiming to improve interpretability and reasoning capabilities.
Contribution
It introduces new approaches for learning structured scene representations in static images and videos, incorporating external knowledge and bias reduction techniques.
Findings
Improved structured representation learning methods
Enhanced interpretability of visual scene models
Discussion of open challenges and future directions
Abstract
As the intermediate-level representations bridging the two levels, structured representations of visual scenes, such as visual relationships between pairwise objects, have been shown to not only benefit compositional models in learning to reason along with the structures but provide higher interpretability for model decisions. Nevertheless, these representations receive much less attention than traditional recognition tasks, leaving numerous open challenges unsolved. In the thesis, we study how machines can describe the content of the individual image or video with visual relationships as the structured representations. Specifically, we explore how structured representations of visual scenes can be effectively constructed and learned in both the static-image and video settings, with improvements resulting from external knowledge incorporation, bias-reducing mechanism, and enhanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
