Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems
William Parris

TL;DR
This paper identifies issues in preference optimization for AI systems, introduces Semantic Reward Collapse as a problem, and proposes Constitutional Reward Stratification as a potential framework to preserve epistemic integrity.
Contribution
It highlights the problem of semantic reward collapse in adaptive AI and proposes a novel, domain-aware reward framework called CRS to address epistemic preservation.
Findings
Semantic Reward Collapse entangles distinct evaluative categories.
Adaptive systems may suppress epistemic failures under generalized reward signals.
CRS offers a testable approach to maintain epistemic differentiation.
Abstract
Recent advances in reinforcement learning from human feedback (RLHF) and preference optimization have substantially improved the usability, coherence, and safety of large language models. However, recurring behaviors such as performative certainty, hallucinated continuity, calibration drift, sycophancy, and suppression of visible uncertainty suggest unresolved structural issues within scalarized preference optimization systems. We propose Semantic Reward Collapse (SRC): the compression of semantically distinct forms of evaluative dissatisfaction into generalized optimization signals. Under SRC, categories such as factual incorrectness, uncertainty disclosure, formatting dissatisfaction, latency, and social preference may become entangled within a shared reward topology despite representing fundamentally different epistemic classes. We argue that adaptive reasoning systems operating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
