Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI

Aryan Mathur; Asaduddin Ahmed

arXiv:2510.23148·cs.LG·October 28, 2025

Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI

Aryan Mathur, Asaduddin Ahmed

PDF

TL;DR

This paper introduces an interleaved transformer architecture with PPO for improved vision-language understanding in reinforcement learning agents, demonstrating enhanced stability and alignment in BabyAI tasks.

Contribution

It presents the PDiT architecture that interleaves perception and decision layers within a transformer, enabling dynamic feedback and improved learning in vision-language RL tasks.

Findings

01

More stable rewards in BabyAI environment

02

Stronger alignment between visual and textual features

03

Enhanced performance over standard PPO baseline

Abstract

Deep reinforcement learning agents often struggle when tasks require understanding both vision and language. Conventional architectures typically isolate perception (for example, CNN-based visual encoders) from decision-making (policy networks). This separation can be inefficient, since the policy's failures do not directly help the perception module learn what is important. To address this, we implement the Perception-Decision Interleaving Transformer (PDiT) architecture introduced by Mao et al. (2023), a model that alternates between perception and decision layers within a single transformer. This interleaving allows feedback from decision-making to refine perceptual features dynamically. In addition, we integrate a contrastive loss inspired by CLIP to align textual mission embeddings with visual scene features. We evaluate the PDiT encoders on the BabyAI GoToLocal environment and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.