Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI
Aryan Mathur, Asaduddin Ahmed

TL;DR
This paper introduces an interleaved transformer architecture with PPO for improved vision-language understanding in reinforcement learning agents, demonstrating enhanced stability and alignment in BabyAI tasks.
Contribution
It presents the PDiT architecture that interleaves perception and decision layers within a transformer, enabling dynamic feedback and improved learning in vision-language RL tasks.
Findings
More stable rewards in BabyAI environment
Stronger alignment between visual and textual features
Enhanced performance over standard PPO baseline
Abstract
Deep reinforcement learning agents often struggle when tasks require understanding both vision and language. Conventional architectures typically isolate perception (for example, CNN-based visual encoders) from decision-making (policy networks). This separation can be inefficient, since the policy's failures do not directly help the perception module learn what is important. To address this, we implement the Perception-Decision Interleaving Transformer (PDiT) architecture introduced by Mao et al. (2023), a model that alternates between perception and decision layers within a single transformer. This interleaving allows feedback from decision-making to refine perceptual features dynamically. In addition, we integrate a contrastive loss inspired by CLIP to align textual mission embeddings with visual scene features. We evaluate the PDiT encoders on the BabyAI GoToLocal environment and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
