Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction

Ce Zhang; Yale Song; Ruta Desai; Michael Louis Iuzzolino; Joseph Tighe; Gedas Bertasius; Satwik Kottur

arXiv:2507.15130·cs.CV·July 22, 2025

Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction

Ce Zhang, Yale Song, Ruta Desai, Michael Louis Iuzzolino, Joseph Tighe, Gedas Bertasius, Satwik Kottur

PDF

TL;DR

This paper introduces VideoPlan, a novel approach that enhances visual planning in videos by using auxiliary tasks and multi-token prediction, achieving state-of-the-art results on multiple datasets.

Contribution

It proposes Auxiliary Task Augmentation and Multi-token Prediction to improve large multimodal models for long-horizon visual planning tasks.

Findings

01

Achieves 7.3% and 3.4% improvements on COIN and CrossTask datasets.

02

Performs on par with state-of-the-art on Ego4D Long-term Action Anticipation.

03

Outperforms prior methods by effectively modeling structured action spaces.

Abstract

Visual Planning for Assistance (VPA) aims to predict a sequence of user actions required to achieve a specified goal based on a video showing the user's progress. Although recent advances in multimodal large language models (MLLMs) have shown promising results in video understanding, long-horizon visual planning remains a challenging problem. We identify two challenges in training large MLLMs for video-based planning tasks: (1) scarcity of procedural annotations, limiting the model's ability to learn procedural task dynamics effectively, and (2) inefficiency of next-token prediction objective to explicitly capture the structured action space for visual planning when compared to free-form, natural language. To tackle data scarcity, we introduce Auxiliary Task Augmentation. We design and train our model on auxiliary tasks relevant to long-horizon video-based planning (e.g., goal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.