Guided Visual Attention Model Based on Interactions Between Top-down and Bottom-up Information for Robot Pose Prediction
Hyogo Hiruma, Hiroki Mori, Hiroshi Ito, Tetsuya Ogata

TL;DR
This paper introduces a novel visual attention model for robot pose prediction that dynamically switches attention targets via external modifications, improving performance and data efficiency in both simulated and real-world environments.
Contribution
A new Key-Query-Value based attention model enabling dynamic attention target modification for robot vision tasks.
Findings
Outperforms existing models in simulation with higher accuracy.
Demonstrates high precision and scalability in real-world robot experiments.
Shows improved data efficiency over traditional models.
Abstract
Deep robot vision models are widely used for recognizing objects from camera images, but shows poor performance when detecting objects at untrained positions. Although such problem can be alleviated by training with large datasets, the dataset collection cost cannot be ignored. Existing visual attention models tackled the problem by employing a data efficient structure which learns to extract task relevant image areas. However, since the models cannot modify attention targets after training, it is difficult to apply to dynamically changing tasks. This paper proposed a novel Key-Query-Value formulated visual attention model. This model is capable of switching attention targets by externally modifying the Query representations, namely top-down attention. The proposed model is experimented on a simulator and a real-world environment. The model was compared to existing end-to-end robot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
