On the Influence of Shape, Texture and Color for Learning Semantic Segmentation
Annika M\"utze, Natalie Grabowsky, Edgar Heinert, Matthias Rottmann, Hanno Gottschalk

TL;DR
This paper investigates how shape, texture, and color cues influence the training of deep neural networks for semantic segmentation, revealing that combined cues improve small object and border pixel predictions.
Contribution
It introduces a method to analyze individual and combined cue influences during training by creating cue-specific datasets and experts for semantic segmentation.
Findings
Shape + color cues improve small object detection.
No single cue dominates in learning success.
Transformers show a stronger shape bias than CNNs.
Abstract
Recent research has investigated the shape and texture biases of pre-trained deep neural networks (DNNs) in image classification. Those works test how much a trained DNN relies on specific image cues like texture. The present study shifts the focus to understanding the cue influence during training, analyzing what DNNs can learn from shape, texture, and color cues in absence of the others; investigating their individual and combined influence on the learning success. We analyze these cue influences at multiple levels by decomposing datasets into cue-specific versions. Addressing semantic segmentation, we learn the given task from these reduced cue datasets, creating cue experts. Early fusion of cues is performed by constructing appropriate datasets. This is complemented by a late fusion of experts which allows us to study cue influence location-dependent on pixel level. Experiments on…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper is easy to read, with a significant level of detail dedicated to the experiments and results. The study is comprehensive in its scope, providing extensive combinations of cue-based datasets and evaluating them across different model architectures. By isolating and recombining shape, color, and texture cues, the paper sheds light on how each of these elements contributes to segmentation performance. This could be beneficial for understanding data domain gaps and predicting failure mod
The study’s approach, while straightforward, lacks sufficient depth in its experimental design and interpretation. The experiments primarily provide surface-level insights without delving into a more profound analysis of underlying factors. Most of the findings and corresponding discussion, for example, that shape cues correspond heavily to semantic boundaries and that textures play a significant role in broader regions—are valid but not particularly surprising and interesting to computer visio
1. This paper comprehensively analyzes the impact of shape, texture, gray and their combination on semantic segmentation tasks, and provides a method to derive a texture-only dataset. This paper compares the different effects of these image cues on CNN and Transformer. 2. The structure of the paper is relatively clear and the introduction of the methods is relatively detailed. 3. Studying the impact of different image cues on semantic segmentation is very meaningful for designing networks and co
1. This paper shows the effects of different visual cues on semantic segmentation, but lacks a detailed analysis of why these effects occur, and explores which modules and operations are introduced into the model to reduce these effects. 2. In Table 3, we find that the rank change of the CARLA dataset is large relative to the Cityscapes dataset. Please provide more explanations why different performances are shown on different datasets. 3. Figure 6 is not clear enough.
Previous shape and textures bias studies rarely considered dense prediction tasks, so this looks like a valuable contribution. The proposed study is the first that investigates the effect of individual cues (and different combinations of cues) on the training process of the semantic segmentation models. The paper consolidates a method for extraction of different image cues from natural images. This enables transformation of the original datasets into variants that contain a single image cue or
Presentation quality could be improved. I suggest placing the tables and figures right after being referenced in the text. I found the Texture Cue Extraction paragraph confusing. The main manuscript should include more details and be more descriptive. I am not sure how the Voronoi diagrams are created and do they depend on the content of the corresponding image. Are class frequencies and distributions preserved in this texture cue dataset? The last paragraph in 4.1 should also describe the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection
MethodsEntropy Regularization · Sparse Evolutionary Training · Proximal Policy Optimization · CARLA: An Open Urban Driving Simulator
