Learning to Compose Dynamic Tree Structures for Visual Contexts
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, Wei Liu

TL;DR
This paper introduces VCTree, a dynamic binary tree model for visual context reasoning that adapts to each image and task, improving performance in scene graph generation and visual Q&A.
Contribution
The paper proposes a novel dynamic tree structure, VCTree, with a task-dependent scoring function and a hybrid learning method combining supervised and reinforcement learning.
Findings
VCTree outperforms state-of-the-art methods on Visual Genome and VQA2.0 benchmarks.
The model discovers interpretable visual context structures.
Dynamic trees improve reasoning over static graph representations.
Abstract
We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A. Our visual context tree model, dubbed VCTree, has two key advantages over existing structured object representations including chains and fully-connected graphs: 1) The efficient and expressive binary tree encodes the inherent parallel/hierarchical relationships among objects, e.g., "clothes" and "pants" are usually co-occur and belong to "person"; 2) the dynamic structure varies from image to image and task to task, allowing more content-/task-specific message passing among objects. To construct a VCTree, we design a score function that calculates the task-dependent validity between each object pair, and the tree is the binary version of the maximum spanning tree from the score matrix. Then, visual contexts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
