Correspondence-Oriented Imitation Learning: Flexible Visuomotor Control with 3D Conditioning
Yunhao Cao, Zubin Bhaumik, Jessie Jia, Xingyi He, Kuan Fang

TL;DR
COIL introduces a flexible, correspondence-based visuomotor control framework that adapts to various task granularities using a spatio-temporal attention policy trained with self-supervised learning, enabling robust real-world manipulation.
Contribution
It proposes a novel correspondence-oriented task representation with variable granularity and a scalable self-supervised training pipeline for flexible visuomotor control.
Findings
Outperforms prior methods on real-world manipulation tasks.
Supports variable spatial and temporal task specifications.
Generalizes across different objects and motion patterns.
Abstract
We introduce Correspondence-Oriented Imitation Learning (COIL), a conditional policy learning framework for visuomotor control with a flexible task representation in 3D. At the core of our approach, each task is defined by the intended motion of keypoints selected on objects in the scene. Instead of assuming a fixed number of keypoints or uniformly spaced time intervals, COIL supports task specifications with variable spatial and temporal granularity, adapting to different user intents and task requirements. To robustly ground this correspondence-oriented task representation into actions, we design a conditional policy with a spatio-temporal attention mechanism that effectively fuses information across multiple input modalities. The policy is trained via a scalable self-supervised pipeline using demonstrations collected in simulation, with correspondence labels automatically generated…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1.Flexible task interface: Clean formulation of 3D correspondence specs that support variable K and H, bridging between sparse start–goal and dense flows within one interface. This is clearly articulated and practically useful. 2.Architectural design: The interleaved temporal-self, spatial-self, and point-cloud cross-attention is a thoughtful way to ground sparse plans into precise actions; the role of normalized temporal P.E. is well motivated. 3.Scalable training recipe: Hindsight correspo
1.Baseline coverage/fairness: While 2D-flow and end-effector baselines are included, comparisons omit strong 3D point-cloud policies (e.g., recent 3D diffusion/point-conditioned policies also cited by the paper), which would better isolate the benefit of the correspondence interface vs. modern 3D observation encoders. Additionally, General-Flow is only applicable in the dense setting and RT-Trajectory is retrained on the authors’ data, leaving open questions of tuning parity. 2.Reliance on ext
- The paper attempts to address an important problem of task specification for conditioning robot policies. - The authors compare the proposed method on 3 real world tasks across sparse, medium, and dense task specification settings. COIL outperforms baselines in all these settings. - The paper includes an ablation study to justify the design choices in COIL. - The paper also includes qualitative results which give the reader a better understanding of the working of the method and its failure mo
Including both weaknesses as well as questions tied to the weaknesses below. - How are the correspondence representations specified during training and at inference? How are the keypoints on each object determined and how is the trajectory of keypoints obtained? Once keypoints on each object are detected in the first frame, I believe they can be tracked across the trajectory to get the keypoints for the whole trajectory in the training data. However, how are they obtained during inference when t
* Extending 2D task representations to 3D is a logical step with clear potential benefits in terms of representational power. * While the proposed representation requires specifying the desired trajectory of 3D physical points, it does so in an exceedingly prescriptive manner. Because tracks are sparse and timing other than ordering is not specified, this leaves the model and task the freedom to determine a particular sequence of actions and timing. This makes the method more applicable compare
* The paper could use a pass to make the math more precise and clear. For example, at line 190 the policy is a function of $o_{0:t},c$ and at line 213 as a function of $f(x_t, \rho_t, c_{t:H})$. What is $\rho$? Is that supposed to be $u$? Why is $c$ indexed in that manner? Is there any other proprioception information such as joint angles? * If I understand the mechanism for generating training data in Section 3.3, the idea is to simulate first episodes by randomising the robot actions, and the
- The paper makes important extensions to correspondence-based task specifications by - removing rigid assumptions about fixed keypoint counts and uniformly spaced temporal intervals, enabling specifications with variable spatial and temporal granularity. - requiring only that target coordinates be reached in sequential order rather than at predetermined timesteps, allowing the policy to dynamically adapt execution speed and exhibit recovery behaviors online. - The experiments are mainly do
- While Tab.1 demonstrates that COIL can execute tasks conditioned on correspondence specifications of varying granularity, most evaluated methods ***require correspondence inputs to be provided externally***. Since real-world deployment typically begins with language instructions rather than ground-truth correspondences, the paper should more thoroughly evaluate end-to-end performance with automatic correspondence generation. The current evaluation focuses primarily on execution given correspon
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Human Pose and Action Recognition
