Resa: Transparent Reasoning Models via SAEs
Shangshang Wang, Julian Asilis, \"Omer Faruk Akg\"ul, Enes Burak Bilgin, Ollie Liu, Deqing Fu, and Willie Neiswanger

TL;DR
Resa introduces a novel SAE-Tuning method to efficiently elicit strong reasoning abilities in language models, significantly reducing training costs while maintaining high performance and enabling modular, generalizable reasoning skills.
Contribution
The paper presents SAE-Tuning, a new approach that captures reasoning abilities from source models and transfers them to target models with minimal cost and training time.
Findings
Retains >97% of RL-trained reasoning performance at a fraction of the cost
Enables reasoning in lightly RL-trained models for around $1
Extracted reasoning abilities are generalizable and modular
Abstract
How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart's reasoning performance while reducing training costs by >2000x to roughly $1 and training time by >450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsBalanced Selection · Sparse Autoencoder
