TL;DR
This paper introduces COCOTree, a large-scale dataset and benchmark for hierarchical image decomposition into visual components, utilizing automated annotation and a new evaluation metric.
Contribution
It presents a fully automated pipeline for creating a hierarchical visual decomposition dataset and establishes a standardized evaluation protocol for open tree-structured segmentation.
Findings
Constructed COCOTree with over 21K images and 1.8M nodes.
Achieved strong alignment of generated annotations with human judgment.
Proposed the OTQ metric for comprehensive evaluation.
Abstract
We formalize and enable the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Specifically, we provide the foundation benchmark for this new paradigm with the following three key contributions. First, we overcome the prohibitively high cognitive and physical bottlenecks of manual annotation by developing a fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models (LVLMs) with the precise geometric grounding of SAM 3. Second, leveraging this pipeline, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes. By embracing an open-vocabulary space of over 3.5K unique labels, it successfully captures the long-tail distribution of complex physical assemblies. Notably, rigorous human evaluation confirms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
