Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Aaditya Vikram Prasad; Connor Watts; Jack Merullo; Dhruvil Gala; Owen Lewis; Thomas McGrath; Ekdeep Singh Lubana

arXiv:2602.10067·cs.LG·February 19, 2026

Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, Ekdeep Singh Lubana

PDF

Open Access

TL;DR

This paper introduces RLFR, a reinforcement learning framework that uses interpretability-derived features as scalable rewards to reduce hallucinations in language models while maintaining their performance.

Contribution

It presents a novel RL pipeline that leverages feature-based rewards for open-ended tasks, enabling scalable supervision and improved factual accuracy.

Findings

01

58% reduction in hallucinations compared to original model

02

Maintains performance on standard benchmarks

03

Operates efficiently with scalable test-time compute

Abstract

Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Multimodal Machine Learning Applications