An Investigation on The Position Encoding in Vision-Based Dynamics   Prediction

Jiageng Zhu; Hanchen Xie; Jiazhi Li; Mahyar Khayatkhoei; Wael; AbdAlmageed

arXiv:2408.15201·cs.CV·August 28, 2024

An Investigation on The Position Encoding in Vision-Based Dynamics Prediction

Jiageng Zhu, Hanchen Xie, Jiazhi Li, Mahyar Khayatkhoei, Wael, AbdAlmageed

PDF

Open Access

TL;DR

This paper investigates how position encoding via bounding boxes influences vision-based dynamics prediction, revealing the implicit encoding process and limitations when environment context varies.

Contribution

It provides a detailed analysis of position information encoding using bounding boxes and explores the limitations of relying solely on object abstracts in dynamic prediction.

Findings

01

Bounding boxes can implicitly encode position information through ROI pooling.

02

Using only object abstracts limits prediction accuracy across varying environments.

03

Explicitly modeling environment context improves dynamics prediction robustness.

Abstract

Despite the success of vision-based dynamics prediction models, which predict object states by utilizing RGB images and simple object descriptions, they were challenged by environment misalignments. Although the literature has demonstrated that unifying visual domains with both environment context and object abstract, such as semantic segmentation and bounding boxes, can effectively mitigate the visual domain misalignment challenge, discussions were focused on the abstract of environment context, and the insight of using bounding box as the object abstract is under-explored. Furthermore, we notice that, as empirical results shown in the literature, even when the visual appearance of objects is removed, object bounding boxes alone, instead of being directly fed into the network, can indirectly provide sufficient position information via the Region of Interest Pooling operation for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging