Cortical Policy: A Dual-Stream View Transformer for Robotic Manipulation
Xuening Zhang, Qi Lv, Xiang Deng, Miao Zhang, Xingbo Liu, Liqiang Nie

TL;DR
Cortical Policy introduces a dual-stream view transformer inspired by human brain mechanisms, combining static and dynamic visual reasoning to improve robotic manipulation in complex and dynamic environments.
Contribution
The paper proposes a novel dual-stream view transformer that jointly reasons from static and dynamic views, enhancing spatial understanding and adaptability in robotic manipulation.
Findings
Outperforms state-of-the-art baselines on RLBench and COLOSSEUM benchmarks.
Demonstrates effective handling of spatially complex and dynamic tasks.
Validates the superiority of dual-stream design for visuomotor control.
Abstract
View transformers process multi-view observations to predict actions and have shown impressive performance in robotic manipulation. Existing methods typically extract static visual representations in a view-specific manner, leading to inadequate 3D spatial reasoning ability and a lack of dynamic adaptation. Taking inspiration from how the human brain integrates static and dynamic views to address these challenges, we propose Cortical Policy, a novel dual-stream view transformer for robotic manipulation that jointly reasons from static-view and dynamic-view streams. The static-view stream enhances spatial understanding by aligning features of geometrically consistent keypoints extracted from a pretrained 3D foundation model. The dynamic-view stream achieves adaptive adjustment through position-aware pretraining of an egocentric gaze estimation model, computationally replicating the human…
Peer Reviews
Decision·ICLR 2026 Poster
Clear motivation to force 3D consistency and fuse dynamic cues. Useful ablations: removing the geometric loss drops performance; end‑to‑end fine‑tuning the gaze model underperforms freezing; and heatmaps matter for the dynamic stream.
The framing on Cortical policy is unnecessarily complicated. My understanding is that it produces saliency map about end effector position to get inductive bias. Unsure if we need to fine-tune from a gaze model. We could also just exact the effector location from robot forward kinematics and register on camera images, which seems to be an easy baseline that may perform similarly.
- I believe that this paper studies a relevant and timely problem (imitation learning from multiple RGB cameras, and in this case a combination of static and dynamic views), and is likely to be of interest to the community. The problem is clearly defined and I believe that the shortcomings of prior work is described in enough detail for an unfamiliar reader to appreciate the technical contributions. The paper is generally well written and easy to follow throughout, although the method section is
My initial assessment of the paper is fairly neutral. I believe that the paper and contributions are interesting, but I also do have some concerns that I would like the authors to address: - Since this paper appears to follow the PerAct experimental setup, I was a bit surprised to not see PerAct listed as a baseline. While I understand that this work focuses on implicit view fusion rather than the explicit 3D representation of PerAct, I do believe that the comparison would be useful to readers
1. The paper specifies supervision generation, the SmoothAP-based cyclic consistency loss $L_{cgc}$, the dynamic rendering/pretraining pipeline, and shows ablations isolating architecture, pretraining, and heatmaps. 2. On RLBench, authors claim higher average success than RVT-2 and improved performance on spatial-reasoning and dynamic scenarios; they also include small-scale real-robot tests.
1. Despite an elaborate pipeline, the trajectory-learning advantage may be modest. Even in the authors’ table, some tasks see limited gains or regressions, raising the question of whether the architectural complexity is justified by the overall deltas. 2. It’s not yet conclusive that the dorsal (dynamic) stream is the key driver of improvement; ablations show mixed patterns, and the net gain over a strong static baseline can be small.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAction Observation and Synchronization · Motor Control and Adaptation · EEG and Brain-Computer Interfaces
