Depth-Adaptive Computational Policies for Efficient Visual Tracking
Chris Ying, Katerina Fragkiadaki

TL;DR
This paper introduces a depth-adaptive convolutional Siamese network for video object tracking that dynamically adjusts computation depth, balancing accuracy and efficiency, and outperforming fixed-structure networks in cost-accuracy trade-offs.
Contribution
It proposes a novel depth-adaptive neural network with parametric gating for efficient video tracking, enabling dynamic computation depth control based on scene complexity.
Findings
Achieves state-of-the-art accuracy on VOT2016 benchmark.
Provides higher accuracy at lower computational costs compared to fixed-structure networks.
Extends to other CNN-based tasks for runtime speed-accuracy trade-offs.
Abstract
Current convolutional neural networks algorithms for video object tracking spend the same amount of computation for each object and video frame. However, it is harder to track an object in some frames than others, due to the varying amount of clutter, scene complexity, amount of motion, and object's distinctiveness against its background. We propose a depth-adaptive convolutional Siamese network that performs video tracking adaptively at multiple neural network depths. Parametric gating functions are trained to control the depth of the convolutional feature extractor by minimizing a joint loss of computational cost and tracking error. Our network achieves accuracy comparable to the state-of-the-art on the VOT2016 benchmark. Furthermore, our adaptive depth computation achieves higher accuracy for a given computational cost than traditional fixed-structure neural networks. The presented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Image Enhancement Techniques · Advanced Vision and Imaging
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Siamese Network
