TL;DR
This paper introduces DoTS, a post-hoc framework that synthesizes SFT and RLVR capabilities at inference time via task vector arithmetic, avoiding catastrophic forgetting and gradient conflicts.
Contribution
Proposes Decoupled Test-time Synthesis (DoTS), enabling independent training of SFT and RLVR checkpoints and their combination at inference without model updates.
Findings
DoTS matches or exceeds training-based SFT-RLVR methods on reasoning benchmarks.
It surpasses state-of-the-art models when applied to stronger checkpoints.
It generalizes to out-of-domain benchmarks without re-tuning.
Abstract
SFT and RLVR represent two fundamental yet distinct paradigms for LLM post-training, each excelling in distinct dimensions. SFT expands knowledge breadth while RLVR enhances reasoning depth. Yet integrating these complementary strengths remains a formidable challenge. Sequential training can cause catastrophic forgetting, and joint optimization often suffers from severe gradient conflicts. We analyze SFT and RLVR through the lens of task vectors and reveal three structural properties behind these failures: a 30* magnitude disparity, 45* sign interference, and heterogeneous module-wise update distributions. These findings show SFT and RLVR are difficult to integrate directly, but they also suggest that the two paradigms modify partly complementary components of the model. Motivated by these observations, we propose Decoupled Test-time Synthesis (DoTS), a post-hoc framework allows SFT and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
