Dynamic Attention Networks for Task Oriented Grounding
Soumik Dasgupta, Badri N. Patro, and Vinay P. Namboodiri

TL;DR
This paper introduces a Dynamic Attention Network for task-oriented grounding in visual environments, improving multi-modal fusion of text and visual data for robotic control without prior domain knowledge.
Contribution
The work presents a novel end-to-end trainable architecture that models dynamic attention in continuous 3D visual worlds, enhancing grounding and policy learning for robots.
Findings
Dynamic Attention improves grounding accuracy.
Using 1D convolution accelerates network convergence.
LSTM cell-state effectively models dynamic attention.
Abstract
In order to successfully perform tasks specified by natural language instructions, an artificial agent operating in a visual world needs to map words, concepts, and actions from the instruction to visual elements in its environment. This association is termed as Task-Oriented Grounding. In this work, we propose a novel Dynamic Attention Network architecture for the efficient multi-modal fusion of text and visual representations which can generate a robust definition of state for the policy learner. Our model assumes no prior knowledge from visual and textual domains and is an end to end trainable. For a 3D visual world where the observation changes continuously, the attention on the visual elements tends to be highly co-related from a one-time step to the next. We term this as "Dynamic Attention". In this work, we show that Dynamic Attention helps in achieving grounding and also aids in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
MethodsConvolution
