GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing
Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

TL;DR
GOT-Edit introduces a geometry-aware online model editing method that enhances generic object tracking by integrating 3D geometric cues with 2D semantics, improving robustness under occlusion and clutter.
Contribution
It presents a novel online cross-modality model editing approach that incorporates 3D geometric information into 2D video tracking using a pre-trained Visual Geometry Grounded Transformer.
Findings
Achieves superior robustness and accuracy on multiple GOT benchmarks.
Performs well under occlusion and clutter scenarios.
Demonstrates the effectiveness of combining 2D semantics with 3D geometric reasoning.
Abstract
Human perception for effective object tracking in 2D video streams arises from the implicit use of prior 3D knowledge and semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings, while neglecting 3D geometric cues, making them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to infer geometric cues from only a few 2D images. To address the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing. By leveraging null-space constraints during model…
Peer Reviews
Decision·ICLR 2026 Poster
The authors propose an innovative method for integrating 2D semantic information with 3D geometric information, and conducts experimental validation using datasets from diverse scenarios.
It is recommended to incorporate additional visualization results across diverse scenarios to substantiate the method's generality. While online model editing has achieved performance gains, it may introduce increased algorithmic complexity and computational overhead, potentially impacting real-time performance—particularly with high-resolution video inputs. Furthermore, the section detailing the online model editing methodology lacks detailed formulaic steps for operations based on AlphaEdit, w
1. Introducing 3D geometric cues into online model editing for object tracking is both innovative and practical, providing a new perspective for improving robustness against shape and viewpoint variations. 2. The paper is well-written, logically structured, and easy to follow. 3. Demonstrates improvements on several benchmarks and provides qualitative examples showing better model adaptability during long-term tracking.
1. Since online model editing is computationally non-trivial, reporting runtime comparisons with baselines would help clarify practical deployment feasibility.
1. The introduction of a geometry-aware correspondence learning mechanism is interesting and effective in visual tracking. 2. The proposed approach has been evaluated on multiple state-of-the-art (SOT) benchmarks and demonstrates competitive performance.
- Compared with the baseline, VGGT introduces additional computational overhead. It is recommended that the authors include a speed and FLOPs comparison to better illustrate efficiency. - The VGGT component continuously updates learned knowledge during training. How does the method address potential error accumulation in this iterative learning process? - While obtaining geometric features directly from 2D data without using 3D inputs can reduce data collection costs, it raises concerns about wh
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Advanced Vision and Imaging
