Online Learning and Equilibrium Computation with Ranking Feedback

Mingyang Liu; Yongshan Chen; Zhiyuan Fan; Gabriele Farina; Asuman Ozdaglar; Kaiqing Zhang

arXiv:2603.19221·cs.LG·March 20, 2026

Online Learning and Equilibrium Computation with Ranking Feedback

Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman Ozdaglar, Kaiqing Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper explores online learning with ranking feedback instead of numeric utility, establishing limitations and proposing algorithms that achieve sublinear regret under certain conditions, with applications to game theory and language models.

Contribution

It introduces new algorithms for online learning with ranking feedback, addressing the challenge of limited information and connecting to equilibrium computation in game theory.

Findings

01

Sublinear regret is impossible with instantaneous utility rankings in general.

02

Sublinear regret is also impossible under certain deterministic ranking models.

03

Proposed algorithms achieve sublinear regret when utility sequences have sublinear total variation.

Abstract

Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emph{numeric} utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emph{ranking} over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emph{instantaneous} utility at the current timestep, and rankings induced by the \emph{time-average} utility up to the current timestep, under both \emph{full-information} and \emph{bandit} feedback settings. Using the standard external-regret metric, we show that sublinear…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 3

Strengths

* The paper cleanly formulates adversarial online learning with ranking feedback and explicitly distinguishes instantaneous vs time-averaged ranking models and full-information vs bandit settings. * The authors provide matching-style hardness results and positive results that reveals the tradeoff between hardness and possibility. * The paper shows how ranking-based no-regret dynamics imply convergence to approximate coarse correlated equilibria in general normal-form games (Theorem 7.2 and 7.3).

Weaknesses

* Many positive results require the utility sequence to have sublinear variation (e.g. Assumption 4.2), these conditions are quite strong in fully adversarial environments. * The hardness results for AvgUtil Rank hinge on $\tau$ being extremely small, the paper does not fully clarify how sharp these thresholds are. * Experiments are only briefly mentioned and relegated to the appendix. From the main text, it’s hard to see how the algorithms behave in practice.

Reviewer 02Rating 8Confidence 4

Strengths

The main strength of the paper is that it studies a natural and well-motivated problem that was hitherto unsolved. This problem has no shortage of consequential direct applications and is solvable without requiring fundamentally new machinery. That said, the paper is well written and has non-trivial analysis for both the upper and lower bounds results. Further, the assumptions for the positive results are well justified by the lower bounds constructions.

Weaknesses

NA

Reviewer 03Rating 6Confidence 3

Strengths

The paper is written well and easy to follow. The motivations are clear and ranking model (PL) is standard and popular. Various problem settings are considered, e.g., bandit, full information, and Nash equilibrium. The derivations are clear and results are as expected. Solid and extensive work.

Weaknesses

The ranking model is somewhat limited as having a consistent reward model in practice is very rare, e.g., RLHF in LLM. Is it possible to consider a problem where the reward function itself is sampled from a set of reward functions? This setting might be more realistic. In might be interesting to test inconsistent reward in the simulation.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Game Theory and Applications · Age of Information Optimization