Weakly-supervised Latent Models for Task-specific Visual-Language Control
Xian Yeow Lee, Lasitha Vidyaratne, Gregory Sin, Ahmed Farahat, Chetan Gupta

TL;DR
This paper introduces a task-specific latent dynamics model that improves spatial control for autonomous inspection tasks by learning in a shared latent space, achieving higher success rates with less data.
Contribution
It presents a novel, domain-specific latent dynamics approach that uses goal-state supervision and global action embeddings for efficient visual-language control.
Findings
Achieves 71% success rate in spatial grounding tasks.
Generalizes well to unseen images and instructions.
Uses goal-state supervision to learn effective latent dynamics.
Abstract
Autonomous inspection in hazardous environments requires AI agents that can interpret high-level goals and execute precise control. A key capability for such agents is spatial grounding, for example when a drone must center a detected object in its camera view to enable reliable inspection. While large language models provide a natural interface for specifying goals, using them directly for visual control achieves only 58\% success in this task. We envision that equipping agents with a world model as a tool would allow them to roll out candidate actions and perform better in spatially grounded settings, but conventional world models are data and compute intensive. To address this, we propose a task-specific latent dynamics model that learns state-specific action-induced shifts in a shared latent space using only goal-state supervision. The model leverages global action embeddings and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
