Occlusion Robustness of CLIP for Military Vehicle Classification
Jan Erik van Woerden, Gertjan Burghouts, Lotte Nijskens, Alma M. Liezenga, Sabina van Rooij, Frank Ruis, Hugo J. Kuijf

TL;DR
This study evaluates the robustness of CLIP vision-language models in military vehicle classification under occlusion, revealing that transformer-based models outperform CNNs and that fine-tuning improves occlusion resilience.
Contribution
It provides the first comprehensive analysis of CLIP's robustness to occlusion in military environments, highlighting the effects of occlusion type and model finetuning.
Findings
Transformer-based CLIP models outperform CNNs in occlusion scenarios.
Dispersed, fine-grained occlusions cause more performance degradation.
Finetuning extends robustness, delaying performance drop to over 60% occlusion.
Abstract
Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP's robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model's backbone, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
