Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning

Ru Wang; Wei Huang; Qi Cao; Yusuke Iwasawa; Yutaka Matsuo; Jiaxian Guo

arXiv:2511.01191·cs.CL·March 3, 2026

Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning

Ru Wang, Wei Huang, Qi Cao, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo

PDF

Open Access 3 Reviews

TL;DR

Self-Harmony enhances test-time reinforcement learning by ensuring answer stability across paraphrased inputs, leading to state-of-the-art accuracy and robustness without human supervision.

Contribution

It introduces a novel framework that uses a single model in dual roles and a harmonic mean-based pseudo-label method to improve stability and performance in test-time RL.

Findings

01

Achieves state-of-the-art results on diverse benchmarks.

02

Demonstrates zero training failures across experiments.

03

Provides a robust, label-free test-time learning method.

Abstract

Test-time reinforcement learning (TTRL) offers a label-free paradigm for adapting models using only synthetic signals at inference, but its success hinges on constructing reliable learning signals. Standard approaches such as majority voting often collapse to spurious yet popular answers. We introduce Self-Harmony, a framework built on a simple intuition: the correct answer should remain stable across both an original question and its paraphrase. Self-Harmony operationalizes this by employing a single model in two complementary roles: a Solver to produce answers and a Reframer to rephrase the input. Based on this, we further propose a pseudo-label method: instead of majority voting, it aggregates answer frequencies across these original and reframed views using the harmonic mean. This is a process that naturally selects for solutions stable under reframing, thereby avoiding the common…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- Well-written - Clear and solid intuition - Theoretically well-grounded - Extensive experiments and benchmarks, as well as publicly released anonymized codes. Ablations provide a complete understanding of each component of the proposed algorithm and clearly show its superiority.

Weaknesses

- The authors mention that the reward design plays a role in obtaninig SOTA performance (Section 4.3), but the only mention of how the weights involved with the rewards are set ($w_f, w_d$) is in Appendix E.4, mentioned in the passing only "weights set to match the proportional influence ...". This design choice should sufficiently be elaborated as well. - Although there isn't anything (significantly) wrong with the theoretical discussions, I believe there are some parts that warrant further dis

Reviewer 02Rating 4Confidence 3

Strengths

This paper stands out for its clarity and strong empirical results. The idea of enforcing consistency across paraphrased versions of a question is simple yet powerful—it directly targets a core weakness of existing test-time RL methods that overfit to their own biases. Using a single model in dual roles (Solver and Reframer) is elegant and avoids the need for extra models or supervision, making the approach practical and scalable. The harmonic mean pseudo-label rule is theoretically grounded and

Weaknesses

The harmonic mean rule, while supported by a theoretical derivation, still feels somewhat heuristic in how it’s applied. In practice, the paper does not deeply analyze why the harmonic mean performs better than simpler alternatives like averaging or weighted voting, or under what conditions it might fail. For instance, if the paraphrasing step inadvertently shifts the semantics of the question—introducing small wording biases or contextual cues—the harmonic mean could incorrectly down-weight val

Reviewer 03Rating 6Confidence 3

Strengths

**Soundness**: The proposed method is intuitive and straightforward to implement. It builds on a sound intuition — that a correct answer should remain consistent across semantically equivalent but stylistically distinct formulations of the same question. The paper also provides a comparative analysis of pseudo-label quality between Self-Harmony, TTRL, and Co-Reward, demonstrating that Self-Harmony produces the highest-quality pseudo labels. **Substance**: The experiments are comprehensive, cove

Weaknesses

**Missing analysis**: It would strengthen the paper to include a deeper analysis of why Self-Harmony improves pseudo-label quality. For instance, one could perform a simple inference-based categorization: for each question, sample multiple responses and classify questions into four types based on model confidence and correctness: - Confident (majority answer ≥ N/2 votes) and correct - Not confident (majority answer < N/2 votes) but correct - Confident and incorrect - Not confident and incorr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Topic Modeling