Segment to Focus: Guiding Latent Action Models in the Presence of Distractors

Hamza Adnan; Matthew T. Jackson; Alexey Zakharov

arXiv:2602.02259·cs.LG·February 3, 2026

Segment to Focus: Guiding Latent Action Models in the Presence of Distractors

Hamza Adnan, Matthew T. Jackson, Alexey Zakharov

PDF

Open Access

TL;DR

This paper introduces MaskLAM, a simple enhancement for Latent Action Models that uses segmentation masks to filter out distractors, significantly improving their ability to learn meaningful action representations from noisy, unlabelled videos.

Contribution

MaskLAM is a lightweight modification that incorporates pretrained segmentation masks into LAM training, effectively reducing distractor influence without changing the model architecture.

Findings

01

Up to 4x increase in accrued rewards on noisy MuJoCo tasks

02

3x improvement in latent action quality via linear probe evaluation

03

Effective filtering of background noise in reinforcement learning environments

Abstract

Latent Action Models (LAMs) learn to extract action-relevant representations solely from raw observations, enabling reinforcement learning from unlabelled videos and significantly scaling available training data. However, LAMs face a critical challenge in disentangling action-relevant features from action-correlated noise (e.g., background motion). Failing to filter these distractors causes LAMs to capture spurious correlations and build sub-optimal latent action spaces. In this paper, we introduce MaskLAM -- a lightweight modification to LAM training to mitigate this issue by incorporating visual agent segmentation. MaskLAM utilises segmentation masks from pretrained foundation models to weight the LAM reconstruction loss, thereby prioritising salient information over background elements while requiring no architectural modifications. We demonstrate the effectiveness of our method on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications