Light Future: Multimodal Action Frame Prediction via InstructPix2Pix

Zesen Zhong; Duomin Zhang; and Yijia Li

arXiv:2507.14809·cs.CV·November 5, 2025

Light Future: Multimodal Action Frame Prediction via InstructPix2Pix

Zesen Zhong, Duomin Zhang, and Yijia Li

PDF

Open Access

TL;DR

This paper introduces a lightweight, multimodal visual prediction framework for robotic action forecasting that uses a fine-tuned InstructPix2Pix model to predict future frames based on a single image and textual instruction, outperforming existing methods in efficiency and accuracy.

Contribution

It pioneers the adaptation of InstructPix2Pix for multimodal future frame prediction in robotics, reducing computational costs and inference latency significantly.

Findings

01

Achieves higher SSIM and PSNR than state-of-the-art baselines.

02

Requires only a single image and text prompt for prediction.

03

Enables faster inference with lower GPU demands.

Abstract

Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis