GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

Shih-Fang Chen; Jun-Cheng Chen; I-Hong Jhuo; Yen-Yu Lin

arXiv:2602.14771·cs.CV·March 19, 2026

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

PDF

Open Access

TL;DR

GOT-JEPA introduces a pretraining framework that predicts tracking models from visual data, improving robustness and occlusion handling in object tracking by learning from pseudo-supervision and detailed occlusion modeling.

Contribution

The paper presents GOT-JEPA, a novel model-predictive pretraining approach for object tracking that explicitly enhances occlusion reasoning and generalization to unseen scenarios.

Findings

01

Improves tracking robustness across seven benchmarks.

02

Enhances occlusion perception with OccuSolver.

03

Achieves better generalization to dynamic environments.

Abstract

The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Gaze Tracking and Assistive Technology · Human Pose and Action Recognition