Learning to Assemble Neural Module Tree Networks for Visual Grounding
Daqing Liu, Hanwang Zhang, Feng Wu, Zheng-Jun Zha

TL;DR
This paper introduces NMTree, a modular neural network that grounds natural language in images by following sentence dependency trees, enabling explainable and compositional visual reasoning.
Contribution
It proposes a novel neural module tree network that aligns visual grounding with sentence dependency parsing, improving interpretability and performance.
Findings
Outperforms state-of-the-art methods on multiple benchmarks.
Provides explainable visual grounding through detailed module attention.
Effectively handles parsing errors with end-to-end training.
Abstract
Visual grounding, a task to ground (i.e., localize) natural language in images, essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet. In this paper, we propose to ground natural language in an intuitive, explainable, and composite fashion as it should be. In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed. NMTree disentangles the visual grounding from the composite reasoning, allowing the former to only focus on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
