PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation
Wenbin Tan, Jiawen Lin, Fangyong Wang, Yuan Xie, Yong Xie, Yachao Zhang, Yanyun Qu

TL;DR
PC-CrossDiff introduces a dual-level differential attention framework that significantly improves 3D visual grounding accuracy in complex multi-object scenes by better capturing implicit cues and suppressing irrelevant spatial interference.
Contribution
It proposes a novel dual-level attention architecture with point-level and cluster-level modules for enhanced 3D referring and segmentation performance.
Findings
Achieves state-of-the-art results on ScanRefer, NR3D, and SR3D benchmarks.
Improves [email protected] score by +10.16% on ScanRefer's implicit subset.
Effectively parses implicit spatial cues and suppresses irrelevant spatial relations.
Abstract
3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in simple, single-object scenes, they suffer from severe performance degradation in complex, multi-object scenes that are common in real-world settings, hindering practical deployment. Existing methods face two key challenges in complex, multi-object scenes: inadequate parsing of implicit localization cues critical for disambiguating visually similar objects, and ineffective suppression of dynamic spatial interference from co-occurring objects, resulting in degraded grounding accuracy. To address these challenges, we propose PC-CrossDiff, a unified dual-task framework with a dual-level cross-modal differential attention architecture for 3DREC and 3DRES.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Topic Modeling
