TL;DR
Higher-resolution visual inputs significantly enhance deep reinforcement learning performance and generalization, especially with architectures that effectively process detailed visual information.
Contribution
This work demonstrates the importance of input resolution and proposes architecture modifications that decouple parameter growth from resolution, enabling better scaling.
Findings
Higher resolution inputs improve performance and generalization in deep RL.
Replacing flattening with global average pooling enables resolution scaling without parameter explosion.
Visual scaling yields a 28% performance increase over traditional architectures.
Abstract
Pixel-based deep reinforcement learning agents are typically trained on heavily downsampled visual observations, a convention inherited from early benchmarks rather than grounded in principled design. In this work, we show that observation resolution is a critical yet overlooked variable for policy learning: higher-resolution inputs can substantially improve both performance and generalization, provided the network architecture can process them effectively. We find that the widely used Impala encoder, which flattens spatial features into a vector, suffers from quadratic parameter growth as resolution increases and fails to leverage the additional visual detail. Replacing this operation with global average pooling, as in the Impoola architecture, decouples parameter count from resolution and yields consistent improvements across resolutions and network widths - at their respective best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
