# Occlusion Robustness of CLIP for Military Vehicle Classification

**Authors:** Jan Erik van Woerden, Gertjan Burghouts, Lotte Nijskens, Alma M. Liezenga, Sabina van Rooij, Frank Ruis, Hugo J. Kuijf

arXiv: 2508.20760 · 2025-09-03

## TL;DR

This study evaluates the robustness of CLIP vision-language models in military vehicle classification under occlusion, revealing that transformer-based models outperform CNNs and that fine-tuning improves occlusion resilience.

## Contribution

It provides the first comprehensive analysis of CLIP's robustness to occlusion in military environments, highlighting the effects of occlusion type and model finetuning.

## Key findings

- Transformer-based CLIP models outperform CNNs in occlusion scenarios.
- Dispersed, fine-grained occlusions cause more performance degradation.
- Finetuning extends robustness, delaying performance drop to over 60% occlusion.

## Abstract

Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP's robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model's backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.

---
Source: https://tomesphere.com/paper/2508.20760