# Learning to Detect and Retrieve Objects from Unlabeled Videos

**Authors:** Elad Amrani, Rami Ben-Ari, Tal Hakim, Alex Bronstein

arXiv: 1905.11137 · 2019-10-22

## TL;DR

This paper introduces a novel weakly supervised method for object detection and retrieval in videos that leverages narration and visual cues without manual labels, using contrastive learning and clustering to handle noise.

## Contribution

It presents a new approach to learn object detectors from unlabeled videos by exploiting narration-visual correlation and robust noise handling techniques.

## Key findings

- Effective detection on 11 objects in 5000 frames
- Outperforms baseline weakly-supervised methods
- Approaches upper bound with manual labels

## Abstract

Learning an object detector or retrieval requires a large data set with manual annotations. Such data sets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose to exploit the natural correlation in narrations and the visual presence of objects in video, to learn an object detector and retrieval without any manual labeling involved. We pose the problem as weakly supervised learning with noisy labels, and propose a novel object detection paradigm under these constraints. We handle the background rejection by using contrastive samples and confront the high level of label noise with a new clustering score. Our evaluation is based on a set of 11 manually annotated objects in over 5000 frames. We show comparison to a weakly-supervised approach as baseline and provide a strongly labeled upper bound.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.11137/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/1905.11137/full.md

## References

20 references — full list in the complete paper: https://tomesphere.com/paper/1905.11137/full.md

---
Source: https://tomesphere.com/paper/1905.11137