RPM: Reasoning-Level Personalization for Black-Box Large Language Models

Jieyong Kim; Tongyoung Kim; Soojin Yoon; Jaehyung Kim; Dongha Lee

arXiv:2505.21082·cs.CL·March 3, 2026

RPM: Reasoning-Level Personalization for Black-Box Large Language Models

Jieyong Kim, Tongyoung Kim, Soojin Yoon, Jaehyung Kim, Dongha Lee

PDF

Open Access 3 Reviews

TL;DR

This paper introduces RPM, a novel framework for black-box large language models that personalizes responses by modeling user-specific reasoning structures from behavioral data, improving both accuracy and interpretability.

Contribution

RPM is the first systematic approach to discover user reasoning paths and guide inference, moving beyond response-level personalization to reasoning-level personalization.

Findings

01

RPM outperforms existing methods across four tasks.

02

It enhances personalization performance.

03

It improves interpretability of model responses.

Abstract

While black-box large language models are widely deployed, they produce generic outputs that overlook individual user preferences. Current personalization methods are fundamentally limited to response-level personalization; they only match final outputs, failing to model the underlying reasoning that connects user behavior to responses. To address this, this work introduces reasoning-level personalization as a new paradigm and proposes RPM, the first systematic framework that automatically discovers user-specific reasoning structures from raw behavioral data to guide the model's personalized inference. RPM constructs a structured model of user behavior-built from response-influential features and statistical factors-to create personalized reasoning paths and retrieve beneficial examples for guiding inference through a feature-based retrieval mechanism. Extensive experiments across four…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

(1) This study introduces structured reasoning in LLM generation (which may be quite common nowadays) to the personalization task. This attempt is original. (2) Extensive experiments show that the proposed structure benefits the personalization task more compared to other simpler prompting, RAG-based baselines and two personalization-focus baselines. (3) The method description is clear.

Weaknesses

(1) The novelty seems to lie in the application of a common method to a specific task (i.e., personalization), which can be limited. (2) More baselines are necessary to support the claim that the proposed structured prompting framework is better. There have been many studies that give LLMs explicit structure for reasoning and are not compared in the current manuscript. For example, least-to-most prompting which decompose a hard problem into ordered subproblems and solve them sequentially. This

Reviewer 02Rating 4Confidence 3

Strengths

S1. The paper is well written and addresses an important gap in LLM personalization which is the reasoning step. S2. The evaluation seems to be comprehensive and results show the effectiveness of the technique. S3. The framework is prompting compatible and cost efficient.

Weaknesses

W1: The set of published baseline is rather limitted - it would be good to see more baselines to understand where the technique stands. W2. The paper does not evaluate generailizablity across various choices of LLM from small to large and different architectures. W3. The human evaluation in Sec 4.4 compares RPM against baselines with CoT prompting, but there is no analysis of whether raters might confuse longer, more detailed explanations with genuinely better reasoning. W4. No length-nor

Reviewer 03Rating 6Confidence 3

Strengths

- Well-specified method with prompts provided for each step. - Includes latency and cost analyses (online and offline). - Explicit feature–factor–reasoning structure facilitates diagnosis and analysis. - The experimental evaluation is comprehensive.

Weaknesses

- Extraction/clustering robustness: Reliance on prompt-based LLMs for feature extraction, clustering, and influence/polarity judgments may cause spurious features or misclustered factors—especially in term-heavy domains or under style drift—propagating errors to retrieval and generation. - “Personalized reasoning paths” are a pragmatic approximation of user cognition; they may appear plausible without being literally true. - The method is fundamentally prompt-centric; consequently, the technique

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsFocus