Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Jaemin Kim; Hangeol Chang; Hyunmin Hwang; Choonghan Kim; Jong Chul Ye

arXiv:2505.19075·cs.AI·May 21, 2026

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye

PDF

2 Repos 3 Reviews

TL;DR

Universal Reasoner (UniR) is a modular, plug-and-play reasoning component that enhances frozen large language models' reasoning abilities through composable, task-specific modules trained with verifiable rewards.

Contribution

UniR introduces a novel, additive, modular reasoning framework that can be combined with frozen LLMs, enabling efficient, composable, and cross-domain reasoning capabilities.

Findings

01

UniR outperforms existing fine-tuning methods on mathematical reasoning tasks.

02

Modules trained on smaller models effectively guide larger LLMs.

03

UniR generalizes across domains like vision-language and medical reasoning.

Abstract

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

Originality - Proposes a clean, modular training objective mapping trajectory reward to token-level log-probabilities and uses additive logit fusion at inference which is a conceptually neat design. - Inference-time composition via logit addition is derived from a KL-regularized multi-objective RL objective (Eq. 6), giving the method theoretical grounding absent in prior adapter ensembles. Empirical Quality - Consistent gains across 5 math benchmarks (up to +7.4 avg. over GRPO-full) and 2 trans

Weaknesses

- Tokenizer-alignment prerequisite limits practicality. The paper claims “architecture-agnostic” but requires shared tokenizer (sec. 1, line 6). Is there a way to extend the method with tokenizer-mismatch adaptation strategy (e.g., embedding-level mapping layer trained with UniR) or can we quantify performance drop under vocabulary misalignment. - The central identity (Eq. (4)), representing a trajectory reward as the sum of token log-probabilities of some policy, is assumed and only informally

Reviewer 02Rating 6Confidence 3

Strengths

This model seems very useful both as a research artifact (it seems to tell us that there isn't all that much to reasoning behavior; not many weights are actually needed to represent it "from scratch") as well as to users of AI tech. Enabling a "plug-and-play" ecosystem of foundation models and reasoning models could have a high impact on the community.

Weaknesses

Some of the baseline numbers are suspicious. For example, this paper reports a "backbone only" performance on MATH of 28.7 for llama3.2 3B, but huggingface here (https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) reports a 48.0 accuracy and that's consistent with what ArtificialAnalysis reports (https://artificialanalysis.ai/evaluations/math-500). I'm also having trouble finding the other baseline numbers in prior work or online (GSM8K seems to mostly be 8-shot in evaluations of llama3 as

Reviewer 03Rating 6Confidence 3

Strengths

This paper has a clear motivation: reduce computational costs in post-training and improve the generalization of learned reasoning abilities. The content is high quality and the writing is clear and easy to understand. The experiments are extensive, covering various benchmarks, tasks, model sizes, and series. The proposed UniR shows better performance compared with baseline methods and improves performance on larger backbones that were not used to train UniR. When combined, different UniR mo

Weaknesses

One weakness is the lack of detailed discussion about time and resource costs. The paper should highlight how UniR differs from baseline methods to demonstrate its computational advantages. Another point: the paper needs more analysis of the design choices for the standalone reasoning module. The authors use a smaller LLM as the reasoning module, which is a good initial step. However, have they experimented with using parts of a model instead of a full LLM? Could the reasoning module be furthe

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Business Process Modeling and Analysis