DEGround: An Effective Baseline for Ego-centric 3D Visual Grounding with a Homogeneous Framework

Yani Zhang; Dongming Wu; Hao Shi; Yingfei Liu; Tiancai Wang; Xingping Dong

arXiv:2506.05199·cs.CV·April 29, 2026

DEGround: An Effective Baseline for Ego-centric 3D Visual Grounding with a Homogeneous Framework

Yani Zhang, Dongming Wu, Hao Shi, Yingfei Liu, Tiancai Wang, Xingping Dong

PDF

TL;DR

DEGround introduces a unified, object-centric transformer framework for ego-centric 3D visual grounding, enhancing performance and efficiency over traditional heterogeneous pipelines.

Contribution

It proposes a homogeneous detection and grounding framework with shared queries and introduces two modules for improved instruction grounding.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Outperforms previous methods by 7.52% on EmbodiedScan.

03

Demonstrates the effectiveness of shared object representations.

Abstract

A core task in embodied intelligence is ego-centric 3D visual grounding. Existing methods typically adopt two-stage, heterogeneous pipelines that pair a detector with a separate grounding model. Incompatible decoders and box heads hinder the transfer of object-level priors, and the split training causes redundant re-optimization. To overcome these limitations, we present DEGround, a straight, elegant, and effective framework that centers on object-level sharing over detection and grounding. It employs a set of queries that serves as the common object representation for both detection and grounding, which is decoded by a shared transformer and bounding box head. Building on this homogeneous framework, we further introduce two task-specific plug-in modules to enhance fine-grained instruction grounding. The Regional Activation Grounding module improves spatial-textual alignment by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.