TL;DR
This paper introduces PaIR-Net, a novel framework for jointly predicting action semantics and contact regions in images, supported by a new dataset, PaIR, to advance understanding of human-environment interactions.
Contribution
The paper proposes a unified model and dataset for simultaneous action and contact localization, addressing limitations of existing methods in capturing dual action semantics and spatial context.
Findings
PaIR-Net outperforms baseline models in contact and action prediction tasks.
The PaIR dataset contains 13,979 images with diverse actions, objects, and body parts.
Ablation studies validate the effectiveness of each architectural component.
Abstract
People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider \textbf{what} action is occurring and \textbf{where} it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
