PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot   Scene Rearrangement

Shutong Jin; Ruiyu Wang; Kuangyi Chen; Florian T.Pokorny

arXiv:2410.22059·cs.RO·December 3, 2024

PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement

Shutong Jin, Ruiyu Wang, Kuangyi Chen, Florian T.Pokorny

PDF

Open Access

TL;DR

PACA is a zero-shot scene rearrangement method that uses perspective-aware cross-attention from Stable Diffusion to generate object representations and control viewpoints, enabling more accurate robot task execution.

Contribution

It introduces a unified, perspective-aware cross-attention representation for scene rearrangement that integrates generation, segmentation, and encoding, extending to 6-DoF views in a zero-shot setting.

Findings

01

Achieves 87% average matching accuracy in real robot experiments.

02

Attains 67% execution success rate across various scenes.

03

Supports 6-DoF camera view matching, surpassing previous 3-DoF limitations.

Abstract

Scene rearrangement, like table tidying, is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion can aid by generating natural scenes as goals. To facilitate robot execution, object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation, segmentation, and feature encoding, which can lead to a low success rate due to error accumulation. Furthermore, they lack control over the viewing perspectives of the generated goals, restricting the tasks to 3-DoF settings. In this paper, we propose PACA, a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsDiffusion