Cortical Policy: A Dual-Stream View Transformer for Robotic Manipulation

Xuening Zhang; Qi Lv; Xiang Deng; Miao Zhang; Xingbo Liu; Liqiang Nie

arXiv:2603.21051·cs.RO·March 24, 2026

Cortical Policy: A Dual-Stream View Transformer for Robotic Manipulation

Xuening Zhang, Qi Lv, Xiang Deng, Miao Zhang, Xingbo Liu, Liqiang Nie

PDF

Open Access 3 Reviews

TL;DR

Cortical Policy introduces a dual-stream view transformer inspired by human brain mechanisms, combining static and dynamic visual reasoning to improve robotic manipulation in complex and dynamic environments.

Contribution

The paper proposes a novel dual-stream view transformer that jointly reasons from static and dynamic views, enhancing spatial understanding and adaptability in robotic manipulation.

Findings

01

Outperforms state-of-the-art baselines on RLBench and COLOSSEUM benchmarks.

02

Demonstrates effective handling of spatially complex and dynamic tasks.

03

Validates the superiority of dual-stream design for visuomotor control.

Abstract

View transformers process multi-view observations to predict actions and have shown impressive performance in robotic manipulation. Existing methods typically extract static visual representations in a view-specific manner, leading to inadequate 3D spatial reasoning ability and a lack of dynamic adaptation. Taking inspiration from how the human brain integrates static and dynamic views to address these challenges, we propose Cortical Policy, a novel dual-stream view transformer for robotic manipulation that jointly reasons from static-view and dynamic-view streams. The static-view stream enhances spatial understanding by aligning features of geometrically consistent keypoints extracted from a pretrained 3D foundation model. The dynamic-view stream achieves adaptive adjustment through position-aware pretraining of an egocentric gaze estimation model, computationally replicating the human…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

Clear motivation to force 3D consistency and fuse dynamic cues. Useful ablations: removing the geometric loss drops performance; end‑to‑end fine‑tuning the gaze model underperforms freezing; and heatmaps matter for the dynamic stream.

Weaknesses

The framing on Cortical policy is unnecessarily complicated. My understanding is that it produces saliency map about end effector position to get inductive bias. Unsure if we need to fine-tune from a gaze model. We could also just exact the effector location from robot forward kinematics and register on camera images, which seems to be an easy baseline that may perform similarly.

Reviewer 02Rating 4Confidence 3

Strengths

- I believe that this paper studies a relevant and timely problem (imitation learning from multiple RGB cameras, and in this case a combination of static and dynamic views), and is likely to be of interest to the community. The problem is clearly defined and I believe that the shortcomings of prior work is described in enough detail for an unfamiliar reader to appreciate the technical contributions. The paper is generally well written and easy to follow throughout, although the method section is

Weaknesses

My initial assessment of the paper is fairly neutral. I believe that the paper and contributions are interesting, but I also do have some concerns that I would like the authors to address: - Since this paper appears to follow the PerAct experimental setup, I was a bit surprised to not see PerAct listed as a baseline. While I understand that this work focuses on implicit view fusion rather than the explicit 3D representation of PerAct, I do believe that the comparison would be useful to readers

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper specifies supervision generation, the SmoothAP-based cyclic consistency loss $L_{cgc}$, the dynamic rendering/pretraining pipeline, and shows ablations isolating architecture, pretraining, and heatmaps. 2. On RLBench, authors claim higher average success than RVT-2 and improved performance on spatial-reasoning and dynamic scenarios; they also include small-scale real-robot tests.

Weaknesses

1. Despite an elaborate pipeline, the trajectory-learning advantage may be modest. Even in the authors’ table, some tasks see limited gains or regressions, raising the question of whether the architectural complexity is justified by the overall deltas. 2. It’s not yet conclusive that the dorsal (dynamic) stream is the key driver of improvement; ablations show mixed patterns, and the net gain over a strong static baseline can be small.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAction Observation and Synchronization · Motor Control and Adaptation · EEG and Brain-Computer Interfaces