Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

Massimiliano Pappa; Luca Romani; Valentino Sacco; Alessio Palma; St\'ephane Lathuili\`ere; Fabio Galasso; Xavier Alameda-Pineda; Indro Spinelli

arXiv:2603.23149·cs.AI·March 25, 2026

Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

Massimiliano Pappa, Luca Romani, Valentino Sacco, Alessio Palma, St\'ephane Lathuili\`ere, Fabio Galasso, Xavier Alameda-Pineda, Indro Spinelli

PDF

Open Access

TL;DR

This paper introduces DILLO, a fast, text-based world model that predicts action outcomes without visual simulation, enabling quicker and safer agent steering with significant speed improvements and success rate enhancements.

Contribution

The paper presents DILLO, a novel, language-based world model that replaces visual simulation with semantic descriptions, significantly accelerating proactive agent steering.

Findings

01

Achieves 14x speedup over visual simulation baselines.

02

Improves episode success rate by up to 15 percentage points.

03

Produces high-fidelity semantic descriptions of next states.

Abstract

Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy's latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from "simulate-then-act" to "describe-then-act." DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)