TL;DR
3DAlign-DAER introduces a novel framework with dynamic attention and efficient retrieval for improved fine-grained 3D-text alignment, scalable to large datasets.
Contribution
It proposes a unified approach combining dynamic attention policy and efficient retrieval to enhance 3D-text alignment performance at scale.
Findings
Outperforms traditional methods like KNN in accuracy and efficiency.
Constructed Align3D-2M, a large-scale dataset with 2 million text-3D pairs.
Demonstrates superior results on multiple benchmarks.
Abstract
Despite recent advancements in 3D-text cross-modal alignment, existing state-of-the-art methods still struggle to align fine-grained textual semantics with detailed geometric structures, and their alignment performance degrades significantly when scaling to large-scale 3D databases. To overcome this limitation, we introduce 3DAlign-DAER, a unified framework designed to align text and 3D geometry via the proposed dynamic attention policy and the efficient retrieval strategy, capturing subtle correspondences for diverse cross-modal retrieval and classification tasks. Specifically, during the training, our proposed dynamic attention policy (DAP) employs the Hierarchical Attention Fusion (HAF) module to represent the alignment as learnable fine-grained token-to-point attentions. To optimize these attentions across different tasks and geometric hierarchies, our DAP further exploits the Monte…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
