Supervised Off-Policy Ranking

Yue Jin; Yue Zhang; Tao Qin; Xudong Zhang; Jian Yuan; Houqiang Li,; Tie-Yan Liu

arXiv:2107.01360·cs.LG·June 22, 2022·1 cites

Supervised Off-Policy Ranking

Yue Jin, Yue Zhang, Tao Qin, Xudong Zhang, Jian Yuan, Houqiang Li,, Tie-Yan Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces supervised off-policy ranking (SOPR), a new approach that uses supervised learning with off-policy data to rank policies effectively, focusing on comparison rather than precise performance estimation.

Contribution

The paper proposes a hierarchical Transformer-based model for SOPR that learns to rank policies by minimizing a ranking loss, shifting from performance estimation to policy comparison.

Findings

01

Outperforms baselines in rank correlation

02

Achieves lower regret values

03

Demonstrates improved stability

Abstract

Off-policy evaluation (OPE) is to evaluate a target policy with data generated by other policies. Most previous OPE methods focus on precisely estimating the true performance of a policy. We observe that in many applications, (1) the end goal of OPE is to compare two or multiple candidate policies and choose a good one, which is a much simpler task than precisely evaluating their true performance; and (2) there are usually multiple policies that have been deployed to serve users in real-world systems and thus the true performance of these policies can be known. Inspired by the two observations, in this work, we study a new problem, supervised off-policy ranking (SOPR), which aims to rank a set of target policies based on supervised learning by leveraging off-policy data and policies with known performance. We propose a method to solve SOPR, which learns a policy scoring model by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SOPR-T/SOPR-T
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems · Topic Modeling · Network Packet Processing and Optimization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Byte Pair Encoding · Dropout · Label Smoothing