GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture
Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

TL;DR
GOT-JEPA introduces a pretraining framework that predicts tracking models from visual data, improving robustness and occlusion handling in object tracking by learning from pseudo-supervision and detailed occlusion modeling.
Contribution
The paper presents GOT-JEPA, a novel model-predictive pretraining approach for object tracking that explicitly enhances occlusion reasoning and generalization to unseen scenarios.
Findings
Improves tracking robustness across seven benchmarks.
Enhances occlusion perception with OccuSolver.
Achieves better generalization to dynamic environments.
Abstract
The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Gaze Tracking and Assistive Technology · Human Pose and Action Recognition
