AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation
Yuanwen Yue, Sabarinath Mahadevan, Jonas Schult, Francis Engelmann,, Bastian Leibe, Konrad Schindler, Theodora Kontogianni

TL;DR
AGILE3D is an attention-based model for interactive multi-object 3D segmentation that improves accuracy, efficiency, and user experience by enabling simultaneous segmentation and explicit click interactions.
Contribution
It introduces a novel attention-guided approach supporting multi-object segmentation with fewer clicks and faster inference in 3D point clouds.
Findings
Sets new state-of-the-art on four 3D datasets.
Requires fewer user clicks for accurate segmentation.
Proven effective in real-world user studies.
Abstract
During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. The current best practice formulates the problem as binary classification and segments objects one at a time. The model expects the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks on regions wrongly assigned to the object. Sequentially visiting objects is wasteful since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects. Moreover, a direct competition between adjacent objects can speed up the identification of their common boundary. We…
Peer Reviews
Decision·ICLR 2024 poster
1) The paper provides detailed information and resources about AGILE3D, allowing readers to gain a deeper understanding of the framework and its implementation. 2) The appendix includes additional materials such as network architecture, algorithms, and data samples, which further support the understanding and practical implementation of AGILE3D. The authors provide clear and detailed explanations of the concepts, algorithms, and techniques used in AGILE3D, ensuring that readers can follow and re
1) Lack of Novelty: For the proposed methods, the novel point is the multi-object segmentation, and the fast inference is also a novel point to some degree. And for efficiency, the submission does not present a technical point on how to improve efficiency. Multi-object segmentation is also achieved by the previous methods (InterObject3D) via some small improvements. Actually, the performance on the multi-object segmentation is not the best, according to Table 4. 2) Limited Evaluation: The append
- The proposed approach pioneers in the realm of interactive multi-object 3D segmentation by introducing an avant-garde attention-centric model. - The real-world practicality of the proposed approach is convincing with achieving top-tier results on both the popular individual and multi-object interactive segmentation benchmarks, w.r.t. SOTA methods such as Mask3D. I also appreciate the detailed computational cost comparisons. - AGILE3D's potential to discern multiple entities with diminished use
- Discussion of limitations/drawbacks should be put in the main body of the manuscript instead of the appendix for a fairer portrayal. - More failure cases need to be showcased and analyzed if any, otherwise the paper seems to focus too much on the advantages of the presented method and does not always give the whole picture.
1. The interactive approach AGLIE3D can segment multiple objects simultaneously with limited user clicks. Compared to previous single-object iterative models, the proposed approach can reduce annotation time. 2. Sufficient experiments have been conducted to show the promising results of the proposed method.
1. How does the model correct wrongly segmented object instance? For example, in Figure 6, if the initial segmentation wrongly groups two chairs into one instance, how can a later click fix this mistake? 2. In Table 9, results 2 and 7 show minor performance drops without the C2C attention and temporal encoding components. Does this indicates that modeling the relations between clicks is less important for the proposed method? 3. During training and testing, the clicks are sampled at the center o
Code & Models
Videos
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Neural Network Applications · 3D Surveying and Cultural Heritage
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
