SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding
Pengxin Xu, Xincheng Lin, Luping Xiao, Qing Jiang, Meishan Zhang, Hao Fei, Shanghang Zhang, Xingyu Chen

TL;DR
SceneParser introduces a hierarchical scene parsing framework that captures structured dependencies in scenes, enabling more comprehensive understanding of objects, parts, and affordances for visual perception tasks.
Contribution
The paper presents SceneParser, a novel VLM-based hierarchical scene parser trained with a large-scale benchmark, improving structured scene understanding over existing methods.
Findings
SceneParser outperforms existing models on hierarchical parsing tasks.
The benchmark contains over 110K training images with detailed annotations.
SceneParser provides structure-aware representations compatible with downstream tasks.
Abstract
General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
