TL;DR
Universal Reasoner (UniR) is a modular, plug-and-play reasoning component that enhances frozen large language models' reasoning abilities through composable, task-specific modules trained with verifiable rewards.
Contribution
UniR introduces a novel, additive, modular reasoning framework that can be combined with frozen LLMs, enabling efficient, composable, and cross-domain reasoning capabilities.
Findings
UniR outperforms existing fine-tuning methods on mathematical reasoning tasks.
Modules trained on smaller models effectively guide larger LLMs.
UniR generalizes across domains like vision-language and medical reasoning.
Abstract
Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs…
Peer Reviews
Decision·Submitted to ICLR 2026
Originality - Proposes a clean, modular training objective mapping trajectory reward to token-level log-probabilities and uses additive logit fusion at inference which is a conceptually neat design. - Inference-time composition via logit addition is derived from a KL-regularized multi-objective RL objective (Eq. 6), giving the method theoretical grounding absent in prior adapter ensembles. Empirical Quality - Consistent gains across 5 math benchmarks (up to +7.4 avg. over GRPO-full) and 2 trans
- Tokenizer-alignment prerequisite limits practicality. The paper claims “architecture-agnostic” but requires shared tokenizer (sec. 1, line 6). Is there a way to extend the method with tokenizer-mismatch adaptation strategy (e.g., embedding-level mapping layer trained with UniR) or can we quantify performance drop under vocabulary misalignment. - The central identity (Eq. (4)), representing a trajectory reward as the sum of token log-probabilities of some policy, is assumed and only informally
This model seems very useful both as a research artifact (it seems to tell us that there isn't all that much to reasoning behavior; not many weights are actually needed to represent it "from scratch") as well as to users of AI tech. Enabling a "plug-and-play" ecosystem of foundation models and reasoning models could have a high impact on the community.
Some of the baseline numbers are suspicious. For example, this paper reports a "backbone only" performance on MATH of 28.7 for llama3.2 3B, but huggingface here (https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) reports a 48.0 accuracy and that's consistent with what ArtificialAnalysis reports (https://artificialanalysis.ai/evaluations/math-500). I'm also having trouble finding the other baseline numbers in prior work or online (GSM8K seems to mostly be 8-shot in evaluations of llama3 as
This paper has a clear motivation: reduce computational costs in post-training and improve the generalization of learned reasoning abilities. The content is high quality and the writing is clear and easy to understand. The experiments are extensive, covering various benchmarks, tasks, model sizes, and series. The proposed UniR shows better performance compared with baseline methods and improves performance on larger backbones that were not used to train UniR. When combined, different UniR mo
One weakness is the lack of detailed discussion about time and resource costs. The paper should highlight how UniR differs from baseline methods to demonstrate its computational advantages. Another point: the paper needs more analysis of the design choices for the standalone reasoning module. The authors use a smaller LLM as the reasoning module, which is a good initial step. However, have they experimented with using parts of a model instead of a full LLM? Could the reasoning module be furthe
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Business Process Modeling and Analysis
