TL;DR
This paper introduces a neural network model based on phase synchrony that mimics human ability to track objects despite appearance changes, outperforming other deep learning models in a controlled tracking challenge.
Contribution
The paper presents a novel complex-valued recurrent neural network that uses neural synchrony to track objects independently of their appearance, inspired by neuroscience theories.
Findings
CV-RNN closely mimics human object tracking performance.
State-of-the-art DNNs struggle with appearance-changing object tracking.
Neural synchrony may serve as a neural substrate for robust object tracking.
Abstract
Objects we encounter often change appearance as we interact with them. Changes in illumination (shadows), object pose, or the movement of non-rigid objects can drastically alter available image features. How do biological visual systems track objects as they change? One plausible mechanism involves attentional mechanisms for reasoning about the locations of objects independently of their appearances -- a capability that prominent neuroscience theories have associated with computing through neural synchrony. Here, we describe a novel deep learning circuit that can learn to precisely control attention to features separately from their location in the world through neural synchrony: the complex-valued recurrent neural network (CV-RNN). Next, we compare object tracking in humans, the CV-RNN, and other deep neural networks (DNNs), using FeatureTracker: a large-scale challenge that asks…
Peer Reviews
Decision·ICLR 2025 Poster
1. The design of CV-RNN is well-motivated by prominent neuroscience theories (e.g., binding-by-synchrony) and serves as a proof-of-concept that neural synchrony aids in object tracking. 2. The combination of the FeatureTracker challenge and human psychophysics experiments provides a valuable framework for studying object tracking in a controlled environment, emphasizing the significant performance gap between current deep learning models and human capabilities. 3. CV-RNN outperforms other baseli
1. Although CV-RNN approaches human performance on the FeatureTracker challenge, the use of synthetic datasets featuring simplified changes in appearance and shape may limit the generalizability of the findings to real-world object tracking, which is more challenging. Using naturalistic videos such as DAVIS would significantly strengthen the paper’s claims. 2. Comparing with self-supervised visual representation learning methods such as VideoMAE and DINO would be useful. The baseline methods in
I think the question of object tracking is interesting. The benchmark is interestingly conceived.
The evaluations are very simple and don't really support strong claims about the new architecture being much better than previous architectures. The test of "standard" DNNs for the purpose of baselines is perhaps a little shaky. The benchmark, while interesting, is limited by being in such a toy condition. I would be much more convinced if the algorithms here were also tested on recent object tracking benchmarks such as TAP-VID. (Or if it was convincingly explained why such benchmarks
1. The paper establishes a new dataset and challenge for object tracking with changing object appearance. The setup to generate this synthetic data allows customization for specific research questions related to object tracking and ablation studies. 2. The authors present a new architecture and establish convincing experiments and results for why the new model (and training technique with specific loss function) brings the model mechanism closer to what is observed in neuroscience in terms of ne
1. The training and test data is made of synthetic data of single color shapes, it is unclear how well it translates to real videos (the authors leave this point for future work) and objects. An interesting intermediate step toward a more natural setting could have been textured shapes and/or a (ideally non-static) background image. 2. An interesting model to try besides the presented baselines would have been other complex-valued RNNs that are not bio-inspired but compatible with the introduced
Videos
Taxonomy
MethodsSoftmax · Attention Is All You Need
