Segment to Focus: Guiding Latent Action Models in the Presence of Distractors
Hamza Adnan, Matthew T. Jackson, Alexey Zakharov

TL;DR
This paper introduces MaskLAM, a simple enhancement for Latent Action Models that uses segmentation masks to filter out distractors, significantly improving their ability to learn meaningful action representations from noisy, unlabelled videos.
Contribution
MaskLAM is a lightweight modification that incorporates pretrained segmentation masks into LAM training, effectively reducing distractor influence without changing the model architecture.
Findings
Up to 4x increase in accrued rewards on noisy MuJoCo tasks
3x improvement in latent action quality via linear probe evaluation
Effective filtering of background noise in reinforcement learning environments
Abstract
Latent Action Models (LAMs) learn to extract action-relevant representations solely from raw observations, enabling reinforcement learning from unlabelled videos and significantly scaling available training data. However, LAMs face a critical challenge in disentangling action-relevant features from action-correlated noise (e.g., background motion). Failing to filter these distractors causes LAMs to capture spurious correlations and build sub-optimal latent action spaces. In this paper, we introduce MaskLAM -- a lightweight modification to LAM training to mitigate this issue by incorporating visual agent segmentation. MaskLAM utilises segmentation masks from pretrained foundation models to weight the LAM reconstruction loss, thereby prioritising salient information over background elements while requiring no architectural modifications. We demonstrate the effectiveness of our method on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
