What-Meets-Where: Unified Learning of Action and Contact Localization in Images

Yuxiao Wang; Yu Lei; Wolin Liang; Weiying Xue; Zhenao Wei; Nan Zhuang; Qi Liu

arXiv:2508.09428·cs.CV·March 31, 2026

What-Meets-Where: Unified Learning of Action and Contact Localization in Images

Yuxiao Wang, Yu Lei, Wolin Liang, Weiying Xue, Zhenao Wei, Nan Zhuang, Qi Liu

PDF

1 Video

TL;DR

This paper introduces PaIR-Net, a novel framework for jointly predicting action semantics and contact regions in images, supported by a new dataset, PaIR, to advance understanding of human-environment interactions.

Contribution

The paper proposes a unified model and dataset for simultaneous action and contact localization, addressing limitations of existing methods in capturing dual action semantics and spatial context.

Findings

01

PaIR-Net outperforms baseline models in contact and action prediction tasks.

02

The PaIR dataset contains 13,979 images with diverse actions, objects, and body parts.

03

Ablation studies validate the effectiveness of each architectural component.

Abstract

People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider \textbf{what} action is occurring and \textbf{where} it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

What-Meets-Where: Unified Learning of Action and Contact Localization in Images· underline