PooDLe: Pooled and dense self-supervised learning from naturalistic   videos

Alex N. Wang; Christopher Hoang; Yuwen Xiong; Yann LeCun; Mengye Ren

arXiv:2408.11208·cs.CV·April 24, 2025

PooDLe: Pooled and dense self-supervised learning from naturalistic videos

Alex N. Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, Mengye Ren

PDF

Open Access 1 Datasets 1 Video

TL;DR

PooDLe introduces a self-supervised learning approach that effectively leverages naturalistic videos by combining pooled and dense objectives, improving spatial and semantic understanding from complex, real-world scenes.

Contribution

The paper presents a novel SSL method that integrates pooled and dense objectives at multiple feature scales for learning from naturalistic videos.

Findings

01

Effective representation learning from naturalistic videos

02

Improved spatial understanding through dense objectives

03

Enhanced semantic understanding via pooled representations

Abstract

Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose PooDLe, a self-supervised learning method that combines an invariance-based objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our results show that a unified objective applied at multiple feature scales is essential for learning effective image representations from naturalistic videos. We validate our method with experiments on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

agentic-learning-ai-lab/Walking-Tours-Semantic
dataset· 43 dl
43 dl

Videos

PooDLe🐩: Pooled and dense self-supervised learning from naturalistic videos· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Video Surveillance and Tracking Methods