InterPol: De-anonymizing LM Arena via Interpolated Preference Learning

Minsung Cho; Jaehyung Kim

arXiv:2603.15220·cs.AI·March 17, 2026

InterPol: De-anonymizing LM Arena via Interpolated Preference Learning

Minsung Cho, Jaehyung Kim

PDF

Open Access

TL;DR

This paper introduces INTERPOL, a novel model-driven framework that enhances de-anonymization of language models by learning deep stylistic patterns through interpolated preference data, revealing significant vulnerabilities in model anonymity.

Contribution

INTERPOL employs interpolated preference learning and adaptive curriculum strategies to improve model identification accuracy over existing methods.

Findings

01

INTERPOL outperforms baseline methods in identification accuracy.

02

It effectively captures deep stylistic patterns missed by statistical features.

03

Experiments demonstrate the real-world threat of ranking manipulation in LM Arena.

Abstract

Strict anonymity of model responses is a key for the reliability of voting-based leaderboards, such as LM Arena. While prior studies have attempted to compromise this assumption using simple statistical features like TF-IDF or bag-ofwords, these methods often lack the discriminative power to distinguish between stylistically similar or within-family models. To overcome these limitations and expose the severity of vulnerability, we introduce INTERPOL, a model-driven identification framework that learns to distinguish target models from others using interpolated preference data. Specifically, INTERPOL captures deep stylistic patterns that superficial statistical features miss by synthesizing hard negative samples through model interpolation and employing an adaptive curriculum learning strategy. Extensive experiments demonstrate that INTERPOL significantly outperforms existing baselines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Authorship Attribution and Profiling · Topic Modeling