COCO-Stuff: Thing and Stuff Classes in Context
Holger Caesar, Jasper Uijlings, Vittorio Ferrari

TL;DR
COCO-Stuff enhances the COCO dataset with detailed pixel-wise annotations for 91 stuff classes, enabling better understanding of scene context, object relationships, and segmentation performance analysis.
Contribution
The paper introduces COCO-Stuff, a large-scale dataset with pixel-wise annotations for stuff classes, and presents an annotation protocol and analysis of scene context and segmentation.
Findings
Stuff classes cover significant image surface area.
Rich spatial relations between stuff and things are identified.
Segmentation performance varies between stuff and thing classes.
Abstract
Semantic classes can be either things (objects with a well-defined shape, e.g. car, person) or stuff (amorphous background regions, e.g. grass, sky). While lots of classification and detection works focus on thing classes, less attention has been given to stuff classes. Nonetheless, stuff classes are important as they allow to explain important aspects of an image, including (1) scene type; (2) which thing classes are likely to be present and their location (through contextual reasoning); (3) physical attributes, material types and geometric properties of the scene. To understand stuff and things in context we introduce COCO-Stuff, which augments all 164K images of the COCO 2017 dataset with pixel-wise annotations for 91 stuff classes. We introduce an efficient stuff annotation protocol based on superpixels, which leverages the original thing annotations. We quantify the speed versus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
