Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding
Heng Zhao, Joey Tianyi Zhou, Yew-Soon Ong

TL;DR
Word2Pix introduces a transformer-based one-stage visual grounding method that treats each word equally, enabling more precise language-to-visual attention and outperforming existing models on standard datasets.
Contribution
The paper proposes a novel encoder-decoder transformer architecture that attends to visual pixels for each word independently, improving grounding accuracy over prior holistic sentence embedding methods.
Findings
Outperforms existing one-stage methods on RefCOCO, RefCOCO+, and RefCOCOg datasets.
Surpasses two-stage models in accuracy while maintaining end-to-end training and real-time inference.
Demonstrates the effectiveness of word-to-pixel attention in visual grounding tasks.
Abstract
Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual feature. Such a formulation does not treat each word of a query sentence on par when modeling language to visual attention, therefore prone to neglect words which are less important for sentence embedding but critical for visual grounding. In this paper we propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. The embedding of each word from the query sentence is treated alike by attending to visual pixels individually instead of single holistic sentence embedding. In this way, each word is given equivalent opportunity to adjust the language to vision attention towards the referent target through multiple stacks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
