Survival is the Only Reward: Sustainable Self-Training Through Environment-Mediated Selection

Jennifer Dodgson; Alfath Daryl Alhajir; Michael Joedhitya; Akira Rafhael Janson Pattirane; Surender Suresh Kumar; Joseph Lim; C.H. Peh; Adith Ramdas; Steven Zhang Zhexu

arXiv:2601.12310·cs.AI·January 21, 2026

Survival is the Only Reward: Sustainable Self-Training Through Environment-Mediated Selection

Jennifer Dodgson, Alfath Daryl Alhajir, Michael Joedhitya, Akira Rafhael Janson Pattirane, Surender Suresh Kumar, Joseph Lim, C.H. Peh, Adith Ramdas, Steven Zhang Zhexu

PDF

Open Access

TL;DR

This paper proposes a novel self-training architecture that relies solely on environmental viability for selection, avoiding reward hacking and enabling sustainable, open-ended self-improvement in autonomous systems.

Contribution

It introduces environment-mediated selection as a new paradigm for stable self-training, demonstrating its effectiveness and unique dynamics compared to reward-based methods.

Findings

01

Effective strategies persist through consolidation and pruning.

02

Models develop meta-learning behaviors without explicit instructions.

03

Environment-grounded selection enables sustainable self-improvement.

Abstract

Self-training systems often degenerate due to the lack of an external criterion for judging data quality, leading to reward hacking and semantic drift. This paper provides a proof-of-concept system architecture for stable self-training under sparse external feedback and bounded memory, and empirically characterises its learning dynamics and failure modes. We introduce a self-training architecture in which learning is mediated exclusively by environmental viability, rather than by reward, objective functions, or externally defined fitness criteria. Candidate behaviours are executed under real resource constraints, and only those whose environmental effects both persist and preserve the possibility of future interaction are propagated. The environment does not provide semantic feedback, dense rewards, or task-specific supervision; selection operates solely through differential survival…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Data Stream Mining Techniques · Personal Information Management and User Behavior