InterPol: De-anonymizing LM Arena via Interpolated Preference Learning
Minsung Cho, Jaehyung Kim

TL;DR
This paper introduces INTERPOL, a novel model-driven framework that enhances de-anonymization of language models by learning deep stylistic patterns through interpolated preference data, revealing significant vulnerabilities in model anonymity.
Contribution
INTERPOL employs interpolated preference learning and adaptive curriculum strategies to improve model identification accuracy over existing methods.
Findings
INTERPOL outperforms baseline methods in identification accuracy.
It effectively captures deep stylistic patterns missed by statistical features.
Experiments demonstrate the real-world threat of ranking manipulation in LM Arena.
Abstract
Strict anonymity of model responses is a key for the reliability of voting-based leaderboards, such as LM Arena. While prior studies have attempted to compromise this assumption using simple statistical features like TF-IDF or bag-ofwords, these methods often lack the discriminative power to distinguish between stylistically similar or within-family models. To overcome these limitations and expose the severity of vulnerability, we introduce INTERPOL, a model-driven identification framework that learns to distinguish target models from others using interpolated preference data. Specifically, INTERPOL captures deep stylistic patterns that superficial statistical features miss by synthesizing hard negative samples through model interpolation and employing an adaptive curriculum learning strategy. Extensive experiments demonstrate that INTERPOL significantly outperforms existing baselines…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Authorship Attribution and Profiling · Topic Modeling
