MotIF: Motion Instruction Fine-tuning

Minyoung Hwang; Joey Hejna; Dorsa Sadigh; Yonatan Bisk

arXiv:2409.10683·cs.RO·November 19, 2025

MotIF: Motion Instruction Fine-tuning

Minyoung Hwang, Joey Hejna, Dorsa Sadigh, Yonatan Bisk

PDF

Open Access 1 Repo

TL;DR

MotIF fine-tunes vision-language models with abstract trajectory representations to improve robotic motion success detection, enabling better understanding of full motion trajectories for various tasks.

Contribution

The paper introduces MotIF, a novel fine-tuning method using abstract motion representations, and presents the MotIF-1K dataset for benchmarking robotic motion understanding.

Findings

01

MotIF outperforms state-of-the-art VLMs by at least twice in precision.

02

MotIF achieves 56.1% higher recall in success detection.

03

Model generalizes well across unseen motions, tasks, and environments.

Abstract

While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs are trained only on single frames, and cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an aggregate input of multiple frames, they still fail to detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Minyoung1005/motif
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Optical measurement and interference techniques

MethodsALIGN