TL;DR
This paper introduces AlignXplore, a model that enhances inductive reasoning for personalized preference inference from behavioral signals, improving accuracy and generalization in user preference modeling.
Contribution
The paper presents AlignXplore, a novel approach combining extended reasoning chains and reinforcement learning to improve personalized preference inference in LLMs.
Findings
Achieves 15.49% improvement over baseline models on benchmarks.
Supports efficient streaming inference and iterative refinement.
Demonstrates emergence of human-like inductive reasoning patterns during training.
Abstract
Large language models (LLMs) have demonstrated significant success in complex reasoning tasks such as math and coding. In contrast to these tasks where deductive reasoning predominates, inductive reasoning-the ability to derive general rules from incomplete evidence, remains underexplored. This paper investigates extended inductive reasoning in LLMs through the lens of personalized preference inference, a critical challenge in LLM alignment where current approaches struggle to capture diverse user preferences. The task demands strong inductive reasoning capabilities as user preferences are typically embedded implicitly across various interaction forms, requiring models to synthesize consistent preference patterns from scattered signals. We propose AlignXplore, a model that leverages extended reasoning chains to enable systematic preference inference from behavioral signals in users'…
Peer Reviews
Decision·Submitted to ICLR 2026
It is a good paper. 1. The paper reads clearly, and the key ideas, notation, and losses are easy to follow. 2. The idea to use an induction model to generate a readable, explicit, portable personalization profile that can be updated over time is innovative and useful in real applications. 3. The two-stage training, combining supervised imitation and RL, feels direct and complete. 4. The experiments cover a wide range, including single-shot and streaming, cross-model checks, and show strong resul
There are stil some light weaknesses to consider. - **LLM-as-Judge reliance.** Results are optimized and evaluated mainly via $R_{\text{jud}}$ (LLM-as-a-judge), with limited human evaluation, so there’s a risk of overfitting to the judge model. - **Out-of-domain coverage.** P-SOUPS is a solid OOD set, but broader tests (e.g., HelpSteer2, UltraFeedback, SHP, and a persona-style corpus) would strengthen the results. - **Readability/notation.** In §3.1, the overloaded $R$ (downstream model vs. re
The paper tackles a very relevant problem of automatic preference personalization for LLMs. The overall approach is reasonable and interpretable by keeping an explicit, text-based description of the preferences.
My main concern is about the clarity of presentation - the paper as it is right now is extremely difficult to read and understand, it discusses too many things at once. Some examples: 1. There's a big emphasis on "inductive reasoning", which is technically correct, but seems largely inconsequential to the method and the problem being tackled. 2. Figure 1 is very dense and hard to navigate before having understood the rest of the paper 3. In section 3, there's a lot of redundant notation, like "u
1. Streaming preference inference is an interesting direction and the proposed method directly targets this goal. The approach allows the model to incrementally refine inferred preferences as new signals come in, without reprocessing the entire user history, improving efficiency significantly. 2. The experiments are systematic and comprehensive, covering both in-domain and out-of-domain datasets while evaluating generalization, robustness, and efficiency. The authors run various ablation studies
1. Although the paper claims to explore extended reasoning for preference inference and lists several advanced reasoning mechanisms in related work, the actual experiments are limited to basic reasoning chains. As a result, the interaction between preference inference and extended reasoning may not be deeply explored, offering limited novelty. 2. The method follows a standard SFT + GRPO pipeline using synthetic data generated from QwQ-32B. Given similar performance compared to off-the-shelf QwQ-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
