Human-oriented Representation Learning for Robotic Manipulation
Mingxiao Huo, Mingyu Ding, Chenfeng Xu, Thomas Tian, Xinghao Zhu, Yao, Mu, Lingfeng Sun, Masayoshi Tomizuka, Wei Zhan

TL;DR
This paper proposes a human-oriented multi-task fine-tuning approach with a Task Fusion Decoder to enhance visual representations for robotic manipulation, outperforming existing methods in simulation and real-world tasks.
Contribution
It introduces a novel multi-task fine-tuning framework with a Task Fusion Decoder that leverages perceptual skills to improve robotic manipulation representations.
Findings
Consistent improvement across various robotic tasks and embodiments.
Enhanced representations for state-of-the-art visual encoders like R3M, MVP, and EgoVLP.
Effective in both simulation and real-world environments.
Abstract
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We advocate that such a representation automatically arises from simultaneously learning about multiple simple perceptual skills that are critical for everyday scenarios (e.g., hand detection, state estimate, etc.) and is better suited for learning robot manipulation policies compared to current state-of-the-art visual representations purely based on self-supervised objectives. We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders, where each task is a perceptual skill tied to human-environment interactions. We introduce Task Fusion Decoder as a plug-and-play embedding translator that utilizes the underlying relationships among these perceptual skills to guide the…
Peer Reviews
Decision·Submitted to ICLR 2024
This work addresses the important question of how to meaningfully learn general-purpose visual representations for manual interaction. A central idea is to leave behind conventional self-supervised training and leverage human examples of manipulation, and train in a supervised fashion on state transitions, which are arguably crucial aspects of any manipulation. The system leverages an existing, labeled dataset (Ego4D), but, thanks to training on surrogate tasks, generalizes to a broad class of m
The intuition of the paper and the grand lines are largely clear but I have spent considerable time trying to understand crucial details. The following statements describe my conclusions, referring to R3M, MVP, and EgoVLP as three possible "backbones" as the paper does: - The encoder part of a pre-trained backbone is connected as input to the TFD. - For training, the system is trained end-to-end on each of the three surrogate tasks simultaneously, training the weights of the TFD from scratch whi
This paper tackles an important problem: how to adapt large pre-trained models to be useful for downstream tasks such as robotics. This is an increasingly important area as foundation models become more common, but such models can be too expensive to train for many research groups. Therefore, having methods that successfully adapt these models with less computationally expensive training is important. The main finding, that fine-tuning video models on specific tasks relating to hand / object int
There are two weaknesses to the paper that I believe can be addressed in rebuttals. First, while the body of the paper is well written and easy to follow, the introduction is inappropriate given the results shown. The claimed message, that the presented fine-tuning represents a method for human alignment without requiring human labels, is not supported by the paper’s results and experiments. What the paper shows is that fine-tuning video models on hand-object interaction tasks improves the repr
* The proposed approach looks to be very flexible, and easily added to existing backbones. * The proposed approach introduces a useful means of exploiting more readily available data of humans performing tasks in robot manipulation tasks where data is typically not as easily obtained, without the need for direct correspondences between human and robot actions or behaviours.
My primary concerns with the paper are around clarity, and the motivation for the choice of auxiliary task decoder objectives. Major: As I understand it, the proposed approach learns to decode 3 aspects related to embodied tasks from video/image representations, object state change classification, point of no return temporal localisation and state change object detection, using self attention to learn relationships between tasks, and cross attention to decode tasks. However, the paper does not
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning
