3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation

Jusheng Zhang; Yijia Fan; Zimo Wen; Jian Wang; Keze Wang

arXiv:2601.04404·cs.CV·January 9, 2026

3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation

Jusheng Zhang, Yijia Fan, Zimo Wen, Jian Wang, Keze Wang

PDF

Open Access

TL;DR

This paper introduces Tri MARF, a tri-modal multi-agent framework that significantly improves large-scale 3D object annotation by integrating multi-view images, text, and 3D data, outperforming existing methods.

Contribution

The novel Tri MARF framework combines three specialized agents to enhance 3D annotation accuracy and efficiency, addressing challenges like occlusion and viewpoint variation.

Findings

01

Achieved a CLIPScore of 88.7, surpassing previous methods.

02

Attained retrieval accuracy of 45.2% and 43.8%.

03

Processed up to 12,000 objects per hour on a single GPU.

Abstract

Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques