# Contrastive Learning through Auxiliary Branch for Video Object Detection

**Authors:** Lucas Rakotoarivony

arXiv: 2508.20551 · 2025-08-29

## TL;DR

This paper introduces CLAB, a contrastive learning method with an auxiliary branch that improves video object detection robustness to image degradation without extra inference cost, achieving state-of-the-art results.

## Contribution

The paper proposes a simple contrastive auxiliary branch and dynamic loss weighting strategy to enhance feature representation in video object detection.

## Key findings

- Achieves 84.0% mAP on ImageNet VID with ResNet-101
- Achieves 85.2% mAP on ImageNet VID with ResNeXt-101
- Demonstrates consistent performance gains through experiments

## Abstract

Video object detection is a challenging task because videos often suffer from image deterioration such as motion blur, occlusion, and deformable shapes, making it significantly more difficult than detecting objects in still images. Prior approaches have improved video object detection performance by employing feature aggregation and complex post-processing techniques, though at the cost of increased computational demands. To improve robustness to image degradation without additional computational load during inference, we introduce a straightforward yet effective Contrastive Learning through Auxiliary Branch (CLAB) method. First, we implement a constrastive auxiliary branch using a contrastive loss to enhance the feature representation capability of the video object detector's backbone. Next, we propose a dynamic loss weighting strategy that emphasizes auxiliary feature learning early in training while gradually prioritizing the detection task as training converges. We validate our approach through comprehensive experiments and ablation studies, demonstrating consistent performance gains. Without bells and whistles, CLAB reaches a performance of 84.0% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, respectively, on the ImageNet VID dataset, thus achieving state-of-the-art performance for CNN-based models without requiring additional post-processing methods.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20551/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20551/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/2508.20551/full.md

---
Source: https://tomesphere.com/paper/2508.20551