Are Deep Learning Models Robust to Partial Object Occlusion in Visual Recognition Tasks?
Kaleb Kassaw, Francesco Luzi, Leslie M. Collins, Jordan M. Malof

TL;DR
This paper introduces the IRUO dataset to benchmark deep learning models' robustness to partial object occlusion, revealing that ViT models outperform CNNs and approach human accuracy, especially under certain occlusion types.
Contribution
The paper presents the IRUO dataset for evaluating occlusion robustness and compares modern CNN and ViT models against human performance on occluded images.
Findings
ViT models outperform CNNs on occluded images.
Deep models are less accurate than humans under diffuse occlusion.
Certain occlusion types significantly reduce model accuracy.
Abstract
Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under conditions of partial occlusion, i.e., conditions in which objects are partially covered from the view of a camera. Methods to improve performance under occlusion, including data augmentation, part-based clustering, and more inherently robust architectures, including Vision Transformer (ViT) models, have, to some extent, been evaluated on their ability to classify objects under partial occlusion. However, evaluations of these methods have largely relied on images containing artificial occlusion, which are typically computer-generated and therefore inexpensive to label. Additionally, methods are rarely compared against each other, and many methods are compared against early, now outdated, deep learning models. We contribute the Image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications
MethodsLinear Layer · Multi-Head Attention · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Vision Transformer · Softmax · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer
