An Investigation on The Position Encoding in Vision-Based Dynamics Prediction
Jiageng Zhu, Hanchen Xie, Jiazhi Li, Mahyar Khayatkhoei, Wael, AbdAlmageed

TL;DR
This paper investigates how position encoding via bounding boxes influences vision-based dynamics prediction, revealing the implicit encoding process and limitations when environment context varies.
Contribution
It provides a detailed analysis of position information encoding using bounding boxes and explores the limitations of relying solely on object abstracts in dynamic prediction.
Findings
Bounding boxes can implicitly encode position information through ROI pooling.
Using only object abstracts limits prediction accuracy across varying environments.
Explicitly modeling environment context improves dynamics prediction robustness.
Abstract
Despite the success of vision-based dynamics prediction models, which predict object states by utilizing RGB images and simple object descriptions, they were challenged by environment misalignments. Although the literature has demonstrated that unifying visual domains with both environment context and object abstract, such as semantic segmentation and bounding boxes, can effectively mitigate the visual domain misalignment challenge, discussions were focused on the abstract of environment context, and the insight of using bounding box as the object abstract is under-explored. Furthermore, we notice that, as empirical results shown in the literature, even when the visual appearance of objects is removed, object bounding boxes alone, instead of being directly fed into the network, can indirectly provide sufficient position information via the Region of Interest Pooling operation for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging
