GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

Shih-Fang Chen; Jun-Cheng Chen; I-Hong Jhuo; Yen-Yu Lin

arXiv:2602.08550·cs.CV·February 25, 2026

GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

PDF

Open Access 3 Reviews

TL;DR

GOT-Edit introduces a geometry-aware online model editing method that enhances generic object tracking by integrating 3D geometric cues with 2D semantics, improving robustness under occlusion and clutter.

Contribution

It presents a novel online cross-modality model editing approach that incorporates 3D geometric information into 2D video tracking using a pre-trained Visual Geometry Grounded Transformer.

Findings

01

Achieves superior robustness and accuracy on multiple GOT benchmarks.

02

Performs well under occlusion and clutter scenarios.

03

Demonstrates the effectiveness of combining 2D semantics with 3D geometric reasoning.

Abstract

Human perception for effective object tracking in 2D video streams arises from the implicit use of prior 3D knowledge and semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings, while neglecting 3D geometric cues, making them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to infer geometric cues from only a few 2D images. To address the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing. By leveraging null-space constraints during model…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The authors propose an innovative method for integrating 2D semantic information with 3D geometric information, and conducts experimental validation using datasets from diverse scenarios.

Weaknesses

It is recommended to incorporate additional visualization results across diverse scenarios to substantiate the method's generality. While online model editing has achieved performance gains, it may introduce increased algorithmic complexity and computational overhead, potentially impacting real-time performance—particularly with high-resolution video inputs. Furthermore, the section detailing the online model editing methodology lacks detailed formulaic steps for operations based on AlphaEdit, w

Reviewer 02Rating 8Confidence 5

Strengths

1. Introducing 3D geometric cues into online model editing for object tracking is both innovative and practical, providing a new perspective for improving robustness against shape and viewpoint variations. 2. The paper is well-written, logically structured, and easy to follow. 3. Demonstrates improvements on several benchmarks and provides qualitative examples showing better model adaptability during long-term tracking.

Weaknesses

1. Since online model editing is computationally non-trivial, reporting runtime comparisons with baselines would help clarify practical deployment feasibility.

Reviewer 03Rating 4Confidence 5

Strengths

1. The introduction of a geometry-aware correspondence learning mechanism is interesting and effective in visual tracking. 2. The proposed approach has been evaluated on multiple state-of-the-art (SOT) benchmarks and demonstrates competitive performance.

Weaknesses

- Compared with the baseline, VGGT introduces additional computational overhead. It is recommended that the authors include a speed and FLOPs comparison to better illustrate efficiency. - The VGGT component continuously updates learned knowledge during training. How does the method address potential error accumulation in this iterative learning process? - While obtaining geometric features directly from 2D data without using 3D inputs can reduce data collection costs, it raises concerns about wh

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Advanced Vision and Imaging