Light Future: Multimodal Action Frame Prediction via InstructPix2Pix
Zesen Zhong, Duomin Zhang, and Yijia Li

TL;DR
This paper introduces a lightweight, multimodal visual prediction framework for robotic action forecasting that uses a fine-tuned InstructPix2Pix model to predict future frames based on a single image and textual instruction, outperforming existing methods in efficiency and accuracy.
Contribution
It pioneers the adaptation of InstructPix2Pix for multimodal future frame prediction in robotics, reducing computational costs and inference latency significantly.
Findings
Achieves higher SSIM and PSNR than state-of-the-art baselines.
Requires only a single image and text prompt for prediction.
Enables faster inference with lower GPU demands.
Abstract
Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
