Learning to Compose Dynamic Tree Structures for Visual Contexts

Kaihua Tang; Hanwang Zhang; Baoyuan Wu; Wenhan Luo; Wei Liu

arXiv:1812.01880·cs.CV·December 6, 2018·31 cites

Learning to Compose Dynamic Tree Structures for Visual Contexts

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, Wei Liu

PDF

Open Access 5 Repos

TL;DR

This paper introduces VCTree, a dynamic binary tree model for visual context reasoning that adapts to each image and task, improving performance in scene graph generation and visual Q&A.

Contribution

The paper proposes a novel dynamic tree structure, VCTree, with a task-dependent scoring function and a hybrid learning method combining supervised and reinforcement learning.

Findings

01

VCTree outperforms state-of-the-art methods on Visual Genome and VQA2.0 benchmarks.

02

The model discovers interpretable visual context structures.

03

Dynamic trees improve reasoning over static graph representations.

Abstract

We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A. Our visual context tree model, dubbed VCTree, has two key advantages over existing structured object representations including chains and fully-connected graphs: 1) The efficient and expressive binary tree encodes the inherent parallel/hierarchical relationships among objects, e.g., "clothes" and "pants" are usually co-occur and belong to "person"; 2) the dynamic structure varies from image to image and task to task, allowing more content-/task-specific message passing among objects. To construct a VCTree, we design a score function that calculates the task-dependent validity between each object pair, and the tree is the binary version of the maximum spanning tree from the score matrix. Then, visual contexts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning