MotIF: Motion Instruction Fine-tuning
Minyoung Hwang, Joey Hejna, Dorsa Sadigh, Yonatan Bisk

TL;DR
MotIF fine-tunes vision-language models with abstract trajectory representations to improve robotic motion success detection, enabling better understanding of full motion trajectories for various tasks.
Contribution
The paper introduces MotIF, a novel fine-tuning method using abstract motion representations, and presents the MotIF-1K dataset for benchmarking robotic motion understanding.
Findings
MotIF outperforms state-of-the-art VLMs by at least twice in precision.
MotIF achieves 56.1% higher recall in success detection.
Model generalizes well across unseen motions, tasks, and environments.
Abstract
While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs are trained only on single frames, and cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an aggregate input of multiple frames, they still fail to detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Optical measurement and interference techniques
MethodsALIGN
